A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.
Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.
SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.
To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.
The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.
ML Task Type: Regression - predicting sales
Target Variable: Product_Store_Sales_Total - the revenue generated by each product in each store
Problem Understanding:
Likely Success Metrics:
Key Insight: Despite the business context mentioning "forecasting for upcoming quarter", the data structure suggests this is a cross-sectional sales prediction problem rather than time series forecasting, since there are no temporal features provided.
🎯 Objective
Predict the Product_Store_Sales_Total (target variable), enabling:
⚠️ NOTE:
Despite the term "forecast" being used in the problem statement, the data and setup suggest:
🔍 It's a standard regression task, not a time-based forecast.
Here's why:
Product_Store_Sales_Total) looks like an aggregate, not a point in time.So:
✅ While "forecast" is used for business context, technically it's a supervised regression prediction task.
That distinction is important because it affects:
Hence
Our problem context: Predict sales for any product-store combination based on meaningful business characteristics.
# Essential libraries for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tabulate as tb
import pprint as pp
from scipy import stats
from scipy.stats import skew, zscore
from datetime import datetime
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
OneHotEncoder, StandardScaler,
OrdinalEncoder, PowerTransformer, FunctionTransformer
)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import uniform, randint
#! pip install catboost
from catboost import CatBoostRegressor
from sklearn.linear_model import Ridge
import joblib
from pathlib import Path
import os
from io import StringIO
import sys
from huggingface_hub import login, HfApi
import requests
import json
import warnings
warnings.filterwarnings("ignore")
from IPython.display import HTML # for displaying images in notebook
# Configure display and plotting
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
plt.style.use('default')
sns.set_palette("husl")
#sns.set_palette('mako')
plt.rcParams["figure.figsize"] = (12, 6)
💡 Below code is just for verification (as face some issue with version compatibility)
# print pandas version
print(pd.__version__)
2.3.0
import importlib.metadata
# There was some warning for this dependency (from pandas) So just checking and verifying !!
print(importlib.metadata.version("bottleneck"))
1.5.0
def format_pct(val):
"""Format a val as percentage i.e max 2 decimal value & adding % at the end"""
return f"{val:.2f}%"
def to_pct(value):
"""value is expected to be a normalized float value in [0, 1]"""
return format_pct(value * 100)
def ratio_to_pct(a, b):
"""
Convert a ratio to a percentage
"""
return to_pct(a / b)
def table_it(data, title=None, tablefmt='simple', floatfmt='.2f', orientation='vertical', show_index=False):
"""
Display single dictionary or pandas Series input as a formatted table
Parameters:
-----------
data : dict or pandas.Series
The data to display in table format
title : str, optional
Optional title for the table
tablefmt : str, default 'simple'
Table format style (simple, grid, fancy_grid, etc.)
floatfmt : str, default '.2f'
Float formatting for numerical values
orientation : str, default 'vertical'
'vertical' for rows, 'horizontal' for columns
show_index : bool, default False
Whether to show row numbers/index in the table
Returns:
--------
None - prints formatted table
"""
if title:
print(f"\n{title}")
print("-" * len(title))
# Handle pandas Series
if isinstance(data, pd.Series):
if orientation == 'vertical':
table_data = data.to_frame()
print(tb.tabulate(table_data, headers='keys', tablefmt=tablefmt,
floatfmt=floatfmt, showindex=show_index))
else:
# Horizontal format (with .T)
table_data = data.to_frame().T
print(tb.tabulate(table_data, headers='keys', tablefmt=tablefmt,
floatfmt=floatfmt, showindex=show_index))
# Handle dictionary
elif isinstance(data, dict):
if orientation == 'vertical':
if show_index:
# Add row numbers when show_index=True
table_data = [[i, key, value] for i, (key, value) in enumerate(data.items(), 1)]
headers = ['#', 'Key', 'Value']
else:
# Standard format without index
table_data = [[key, value] for key, value in data.items()]
headers = ['Key', 'Value']
print(tb.tabulate(table_data, headers=headers, tablefmt=tablefmt, floatfmt=floatfmt))
else:
# Horizontal format for dictionary
keys = list(data.keys())
values = list(data.values())
table_data = [values]
if show_index:
# For horizontal, we can add index as first row
index_row = list(range(1, len(keys) + 1))
table_data.insert(0, index_row)
headers = ['#'] + keys
else:
headers = keys
print(tb.tabulate(table_data, headers=headers, tablefmt=tablefmt, floatfmt=floatfmt))
# Handle other iterables (lists, etc.)
elif hasattr(data, '__iter__') and not isinstance(data, str):
if show_index:
# Index is already built-in for this case
table_data = [[i, value] for i, value in enumerate(data, 1)]
headers = ['#', 'Value']
else:
# Use enumerate starting from 1, but don't show explicit index column
table_data = [[value] for value in data]
headers = ['Value']
print(tb.tabulate(table_data, headers=headers, tablefmt=tablefmt, floatfmt=floatfmt))
else:
print(f"Unsupported data type: {type(data)}")
print("Supported types: pandas.Series, dict, or iterable")
def tb_describe(df_col):
"""
Helper function to display descriptive statistics in a nicely formatted table
Parameters:
df_col : pandas Series or DataFrame column
The column to generate descriptive statistics for
Returns:
None - prints formatted table
"""
stats = df_col.describe().to_frame().T
print(tb.tabulate(stats, headers='keys', tablefmt='simple', floatfmt='.2f'))
def classify_column(col: pd.Series) -> str:
"""
Classify a column into one of the following categories:
- ID or Unique Identifier
- Categorical
- Ordinal or Discrete
- Continuous Numerical
- Binary
- Other / Needs Manual Check
"""
if col.nunique() == len(col):
return "ID or Unique Identifier"
elif col.dtype == 'object' or col.dtype.name == 'category':
return "Categorical"
elif col.dtype in ['int64', 'float64']:
if col.nunique() < 10:
return "Ordinal or Discrete (Numerical)"
else:
return "Continuous Numerical"
elif pd.api.types.is_bool_dtype(col):
return "Binary"
else:
return "Other / Needs Manual Check"
df = pd.read_csv('SuperKart.csv')
# backup of original df
df_original = df.copy()
df.head()
| Product_Id | Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Id | Store_Establishment_Year | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FD6114 | 12.66 | Low Sugar | 0.027 | Frozen Foods | 117.08 | OUT004 | 2009 | Medium | Tier 2 | Supermarket Type2 | 2842.40 |
| 1 | FD7839 | 16.54 | Low Sugar | 0.144 | Dairy | 171.43 | OUT003 | 1999 | Medium | Tier 1 | Departmental Store | 4830.02 |
| 2 | FD5075 | 14.28 | Regular | 0.031 | Canned | 162.08 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4130.16 |
| 3 | FD8233 | 12.10 | Low Sugar | 0.112 | Baking Goods | 186.31 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4132.18 |
| 4 | NC1180 | 9.57 | No Sugar | 0.010 | Health and Hygiene | 123.67 | OUT002 | 1998 | Small | Tier 3 | Food Mart | 2279.36 |
def overview_data(df):
"""
Comprehensive data overview function
Args:
df: pandas DataFrame
Returns:
dict: Overview statistics for further use
"""
print("🔍 COMPREHENSIVE DATA OVERVIEW")
print("=" * 50)
print(f"💾 Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print()
# 1. Basic Shape & Structure
r, c = df.shape
print(f"📊 Dataset Shape: {r:,} rows × {c} columns")
print(f"🔄 Duplicate Rows: {df.duplicated().sum()}")
# 2. Missing Values Analysis
print("\n❓ MISSING VALUES ANALYSIS")
print("-" * 30)
missing_count = df.isnull().sum().sum()
missing_pct = ratio_to_pct(missing_count, len(df))
missing_data = pd.DataFrame({
'Column': df.columns,
'Missing_Count': missing_count,
'Missing_Percentage': missing_pct
})
missing_data = missing_data[missing_data['Missing_Count'] > 0]
if len(missing_data) == 0:
print("✅ No missing values found!")
else:
print(missing_data.to_string(index=False))
# 3. Data Types Summary
print("\n📋 DATA TYPES SUMMARY")
print("-" * 30)
dtype_summary = df.dtypes.value_counts()
for dtype, count in dtype_summary.items():
print(f"{dtype}: {count} columns")
# 4. Identify Numerical vs Categorical Columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"\n🔢 NUMERICAL COLUMNS ({len(numerical_cols)}):")
print(f" {numerical_cols}")
print(f"\n🏷️ CATEGORICAL COLUMNS ({len(categorical_cols)}):")
print(f" {categorical_cols}")
# 5. Numerical Columns Statistics
if numerical_cols:
print("\n📈 NUMERICAL COLUMNS STATISTICS")
print("-" * 40)
numerical_stats = df[numerical_cols].describe().T
print(numerical_stats.round(2))
# 6. Categorical Columns Statistics
if categorical_cols:
print("\n📊 CATEGORICAL COLUMNS STATISTICS")
print("-" * 40)
for col in categorical_cols:
unique_count = df[col].nunique()
most_frequent = df[col].mode()[0]
most_frequent_count = df[col].value_counts().iloc[0]
print(f"{col}:")
print(f" - Unique Values: {unique_count}")
print(f" - Most Frequent: '{most_frequent}' ({most_frequent_count:,} times)")
print(f" - Sample Values: {list(df[col].unique()[:5])}")
print()
# 7. Target Variable Quick Stats (assuming last column or specify)
target_col = df.columns[-1] # Assuming target is last column
if df[target_col].dtype in ['int64', 'float64']:
print(f"🎯 TARGET VARIABLE: {target_col}")
print("-" * 30)
print(f"Min: {df[target_col].min():.2f}")
print(f"Max: {df[target_col].max():.2f}")
print(f"Mean: {df[target_col].mean():.2f}")
print(f"Median: {df[target_col].median():.2f}")
print(f"Std Dev: {df[target_col].std():.2f}")
# Return summary for further use
overview_stats = {
'shape': df.shape,
'missing_data': missing_data,
'numerical_cols': numerical_cols,
'categorical_cols': categorical_cols,
'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024**2,
'duplicates': df.duplicated().sum()
}
return overview_stats
# Usage
overview = overview_data(df)
🔍 COMPREHENSIVE DATA OVERVIEW
==================================================
💾 Memory Usage: 4.16 MB
📊 Dataset Shape: 8,763 rows × 12 columns
🔄 Duplicate Rows: 0
❓ MISSING VALUES ANALYSIS
------------------------------
✅ No missing values found!
📋 DATA TYPES SUMMARY
------------------------------
object: 7 columns
float64: 4 columns
int64: 1 columns
🔢 NUMERICAL COLUMNS (5):
['Product_Weight', 'Product_Allocated_Area', 'Product_MRP', 'Store_Establishment_Year', 'Product_Store_Sales_Total']
🏷️ CATEGORICAL COLUMNS (7):
['Product_Id', 'Product_Sugar_Content', 'Product_Type', 'Store_Id', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
📈 NUMERICAL COLUMNS STATISTICS
----------------------------------------
count mean std min 25% 50% \
Product_Weight 8763.0 12.65 2.22 4.0 11.15 12.66
Product_Allocated_Area 8763.0 0.07 0.05 0.0 0.03 0.06
Product_MRP 8763.0 147.03 30.69 31.0 126.16 146.74
Store_Establishment_Year 8763.0 2002.03 8.39 1987.0 1998.00 2009.00
Product_Store_Sales_Total 8763.0 3464.00 1065.63 33.0 2761.72 3452.34
75% max
Product_Weight 14.18 22.0
Product_Allocated_Area 0.10 0.3
Product_MRP 167.58 266.0
Store_Establishment_Year 2009.00 2009.0
Product_Store_Sales_Total 4145.16 8000.0
📊 CATEGORICAL COLUMNS STATISTICS
----------------------------------------
Product_Id:
- Unique Values: 8763
- Most Frequent: 'DR1005' (1 times)
- Sample Values: ['FD6114', 'FD7839', 'FD5075', 'FD8233', 'NC1180']
Product_Sugar_Content:
- Unique Values: 4
- Most Frequent: 'Low Sugar' (4,885 times)
- Sample Values: ['Low Sugar', 'Regular', 'No Sugar', 'reg']
Product_Type:
- Unique Values: 16
- Most Frequent: 'Fruits and Vegetables' (1,249 times)
- Sample Values: ['Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Health and Hygiene']
Store_Id:
- Unique Values: 4
- Most Frequent: 'OUT004' (4,676 times)
- Sample Values: ['OUT004', 'OUT003', 'OUT001', 'OUT002']
Store_Size:
- Unique Values: 3
- Most Frequent: 'Medium' (6,025 times)
- Sample Values: ['Medium', 'High', 'Small']
Store_Location_City_Type:
- Unique Values: 3
- Most Frequent: 'Tier 2' (6,262 times)
- Sample Values: ['Tier 2', 'Tier 1', 'Tier 3']
Store_Type:
- Unique Values: 4
- Most Frequent: 'Supermarket Type2' (4,676 times)
- Sample Values: ['Supermarket Type2', 'Departmental Store', 'Supermarket Type1', 'Food Mart']
🎯 TARGET VARIABLE: Product_Store_Sales_Total
------------------------------
Min: 33.00
Max: 8000.00
Mean: 3464.00
Median: 3452.34
Std Dev: 1065.63
print("Columns Summary:")
summary = pd.DataFrame({
"Column": df.columns,
"Dtype": df.dtypes.values,
"Missing": df.isnull().sum().values,
"Unique": df.nunique().values
})
# Columns Summary
summary
Columns Summary:
| Column | Dtype | Missing | Unique | |
|---|---|---|---|---|
| 0 | Product_Id | object | 0 | 8763 |
| 1 | Product_Weight | float64 | 0 | 1113 |
| 2 | Product_Sugar_Content | object | 0 | 4 |
| 3 | Product_Allocated_Area | float64 | 0 | 228 |
| 4 | Product_Type | object | 0 | 16 |
| 5 | Product_MRP | float64 | 0 | 6100 |
| 6 | Store_Id | object | 0 | 4 |
| 7 | Store_Establishment_Year | int64 | 0 | 4 |
| 8 | Store_Size | object | 0 | 3 |
| 9 | Store_Location_City_Type | object | 0 | 3 |
| 10 | Store_Type | object | 0 | 4 |
| 11 | Product_Store_Sales_Total | float64 | 0 | 8668 |
🔍 Observation
Product_Id and Store_Id are nominal identifiers → can be ignored for modeling
Store_Establishment_Yearcan be treated as categorical though its type is numeric or new feature can be derived likeageto represent it more naturally numeric
# Target variable stats summary
print("Target variable distribution:")
df['Product_Store_Sales_Total'].describe()
Target variable distribution:
count 8763.000000 mean 3464.003640 std 1065.630494 min 33.000000 25% 2761.715000 50% 3452.340000 75% 4145.165000 max 8000.000000 Name: Product_Store_Sales_Total, dtype: float64
🎯 TARGET VARIABLE INSIGHTS:
# ===========================
# 2. FEATURE CATEGORIZATION
# ===========================
# Define feature categories for systematic analysis
NUMERICAL_FEATURES = [
'Product_Weight',
'Product_Allocated_Area',
'Product_MRP',
'Store_Establishment_Year'
]
CATEGORICAL_FEATURES = [
'Product_Sugar_Content',
'Product_Type',
'Store_Size',
'Store_Location_City_Type',
'Store_Type'
]
TARGET_VARIABLE = 'Product_Store_Sales_Total'
print("🏷️ FEATURE CATEGORIZATION")
print("-" * 40)
print(f"Numerical Features ({len(NUMERICAL_FEATURES)}): {NUMERICAL_FEATURES}")
print(f"Categorical Features ({len(CATEGORICAL_FEATURES)}): {CATEGORICAL_FEATURES}")
print(f"Target Variable: {TARGET_VARIABLE}")
🏷️ FEATURE CATEGORIZATION ---------------------------------------- Numerical Features (4): ['Product_Weight', 'Product_Allocated_Area', 'Product_MRP', 'Store_Establishment_Year'] Categorical Features (5): ['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type'] Target Variable: Product_Store_Sales_Total
class UnivariateAnalyzer:
"""
Comprehensive univariate analysis for EDA and preprocessing guidance
Methods:
- analyze_numerical_column: Complete numerical feature analysis
- analyze_categorical_column: Complete categorical feature analysis
"""
def __init__(self, figsize=(12, 6)):
"""
Initialize analyzer with plotting configuration
Args:
figsize: Figure size for plots (default: (12, 6))
"""
self.figsize = figsize
# plt.style.use('default')
def analyze_numerical_column(self, df, column_name):
"""
Comprehensive analysis for a single numerical column
Args:
df: pandas DataFrame
column_name: string, name of the numerical column
Returns:
dict: Analysis results for preprocessing guidance
"""
print(f"📊 NUMERICAL ANALYSIS: {column_name}")
print("=" * 50)
col_data = df[column_name].dropna()
# 1. Basic Statistics
print("📈 DESCRIPTIVE STATISTICS")
print("-" * 30)
desc_stats = col_data.describe()
print(desc_stats.round(3))
# 2. Distribution Analysis
print("\n📋 DISTRIBUTION ANALYSIS")
print("-" * 30)
# Skewness & Kurtosis
skewness = stats.skew(col_data)
kurtosis = stats.kurtosis(col_data)
print(f"Skewness: {skewness:.3f}")
if abs(skewness) < 0.5:
skew_interpretation = "Approximately Normal"
elif abs(skewness) < 1:
skew_interpretation = "Moderately Skewed"
else:
skew_interpretation = "Highly Skewed"
print(f"Interpretation: {skew_interpretation}")
print(f"Kurtosis: {kurtosis:.3f}")
if abs(kurtosis) < 3:
kurt_interpretation = "Normal-like tails"
else:
kurt_interpretation = "Heavy tails"
print(f"Interpretation: {kurt_interpretation}")
# 3. Outlier Detection
print("\n🔍 OUTLIER ANALYSIS")
print("-" * 30)
Q1 = col_data.quantile(0.25)
Q3 = col_data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = col_data[(col_data < lower_bound) | (col_data > upper_bound)]
outlier_percentage = (len(outliers) / len(col_data)) * 100
print("IQR Method:")
print(f" Lower Bound: {lower_bound:.3f}")
print(f" Upper Bound: {upper_bound:.3f}")
print(f" Outliers: {len(outliers)} ({outlier_percentage:.1f}%)")
# 4. Missing Values
missing_count = df[column_name].isnull().sum()
missing_percentage = (missing_count / len(df)) * 100
print("\n❓ MISSING VALUES")
print("-" * 30)
print(f"Count: {missing_count} ({missing_percentage:.1f}%)")
# 5. Preprocessing Recommendations
print("\n🔧 PREPROCESSING RECOMMENDATIONS")
print("-" * 30)
recommendations = []
# Scaling recommendations
if col_data.std() > 1000 or col_data.max() - col_data.min() > 1000:
recommendations.append("Consider StandardScaler or MinMaxScaler (large scale)")
# Transformation recommendations
if skewness > 1:
recommendations.append("Consider Log transformation (right skewed)")
elif skewness < -1:
recommendations.append("Consider Square transformation (left skewed)")
# Outlier handling
if outlier_percentage > 5:
recommendations.append("Consider RobustScaler (many outliers)")
elif outlier_percentage > 1:
recommendations.append("Consider outlier treatment (moderate outliers)")
# Missing value handling
if missing_percentage > 0:
if missing_percentage < 5:
recommendations.append("Consider mean/median imputation")
else:
recommendations.append("Investigate missing pattern, consider advanced imputation")
if not recommendations:
recommendations.append("Data appears clean, minimal preprocessing needed")
for i, rec in enumerate(recommendations, 1):
print(f"{i}. {rec}")
# 6. Visualization
fig, axes = plt.subplots(1, 2, figsize=self.figsize)
fig.suptitle(f'{column_name} - Distribution Analysis', fontsize=14, fontweight='bold')
# Histogram
axes[0].hist(col_data, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].axvline(col_data.mean(), color='red', linestyle='--', label=f'Mean: {col_data.mean():.2f}')
axes[0].axvline(col_data.median(), color='green', linestyle='--', label=f'Median: {col_data.median():.2f}')
axes[0].set_title('Distribution')
axes[0].set_xlabel(column_name)
axes[0].set_ylabel('Frequency')
axes[0].legend()
# Boxplot
axes[1].boxplot(col_data)
axes[1].set_title('Boxplot (Outlier Detection)')
axes[1].set_ylabel(column_name)
plt.tight_layout()
plt.show()
# Return analysis results
analysis_results = {
'skewness': skewness,
'kurtosis': kurtosis,
'outlier_percentage': outlier_percentage,
'missing_percentage': missing_percentage,
'recommendations': recommendations,
'statistics': desc_stats.to_dict()
}
return analysis_results
def analyze_categorical_column(self, df, column_name):
"""
Comprehensive analysis for a single categorical column
Args:
df: pandas DataFrame
column_name: string, name of the categorical column
Returns:
dict: Analysis results for preprocessing guidance
"""
print(f"📊 CATEGORICAL ANALYSIS: {column_name}")
print("=" * 50)
col_data = df[column_name]
# 1. Basic Statistics
print("📈 DESCRIPTIVE STATISTICS")
print("-" * 30)
unique_count = col_data.nunique()
total_count = len(col_data)
print(f"Total Values: {total_count:,}")
print(f"Unique Values: {unique_count}")
print(f"Cardinality: {'High' if unique_count > 10 else 'Medium' if unique_count > 2 else 'Low'}")
# 2. Value Distribution
print("\n📋 VALUE DISTRIBUTION")
print("-" * 30)
value_counts = col_data.value_counts()
value_percentages = col_data.value_counts(normalize=True) * 100
print("Top 10 Values:")
for i, (value, count) in enumerate(value_counts.head(10).items(), 1):
percentage = value_percentages[value]
print(f"{i:2d}. {value}: {count:,} ({percentage:.1f}%)")
# 3. Data Quality Check
print("\n🔍 DATA QUALITY CHECK")
print("-" * 30)
# Missing values
missing_count = col_data.isnull().sum()
missing_percentage = (missing_count / len(col_data)) * 100
print(f"Missing Values: {missing_count} ({missing_percentage:.1f}%)")
# Check for inconsistencies (case sensitivity, whitespace, etc.)
quality_issues = []
# Case sensitivity check
values_lower = [str(v).lower() for v in col_data.dropna().unique()]
if len(values_lower) != len(set(values_lower)):
quality_issues.append("Potential case sensitivity issues")
# Whitespace check
string_values = [str(v) for v in col_data.dropna().unique()]
stripped_values = [v.strip() for v in string_values]
if len(string_values) != len(set(stripped_values)):
quality_issues.append("Potential whitespace issues")
# Check for abbreviated vs full forms
if any(len(str(v)) <= 3 for v in col_data.dropna().unique()):
quality_issues.append("Mix of abbreviated and full forms detected")
if quality_issues:
print("Issues Found:")
for issue in quality_issues:
print(f" ⚠️ {issue}")
else:
print("✅ No major quality issues detected")
# 4. Class Imbalance Analysis
print("\n⚖️ CLASS BALANCE ANALYSIS")
print("-" * 30)
# Check for imbalance
min_percentage = value_percentages.min()
max_percentage = value_percentages.max()
imbalance_ratio = max_percentage / min_percentage
print(f"Most Common: {max_percentage:.1f}%")
print(f"Least Common: {min_percentage:.1f}%")
print(f"Imbalance Ratio: {imbalance_ratio:.1f}:1")
if imbalance_ratio > 10:
balance_status = "Highly Imbalanced"
elif imbalance_ratio > 5:
balance_status = "Moderately Imbalanced"
else:
balance_status = "Reasonably Balanced"
print(f"Status: {balance_status}")
# 5. Preprocessing Recommendations
print("\n🔧 PREPROCESSING RECOMMENDATIONS")
print("-" * 30)
recommendations = []
# Encoding recommendations
if unique_count == 2:
recommendations.append("Consider Label Encoding (binary categorical)")
elif unique_count <= 5:
recommendations.append("Consider One-Hot Encoding (low cardinality)")
else:
recommendations.append("Consider Target Encoding or Embedding (high cardinality)")
# Quality improvements
if quality_issues:
recommendations.append("Clean data inconsistencies before encoding")
# Handle missing values
if missing_percentage > 0:
if missing_percentage < 5:
recommendations.append("Consider mode imputation for missing values")
else:
recommendations.append("Investigate missing pattern, consider 'Unknown' category")
# Handle imbalance
if imbalance_ratio > 10:
recommendations.append("(Imbalance) Consider grouping rare categories or stratified sampling")
for i, rec in enumerate(recommendations, 1):
print(f"{i}. {rec}")
# 6. Visualization
plt.figure(figsize=self.figsize)
# Limit to top 15 categories for readability
top_categories = value_counts.head(15)
# Create bar plot
bars = plt.bar(range(len(top_categories)), top_categories.values,
color=plt.cm.Set3(np.linspace(0, 1, len(top_categories))))
plt.title(f'{column_name} - Value Distribution (Top 15)', fontsize=14, fontweight='bold')
plt.xlabel('Categories')
plt.ylabel('Count')
plt.xticks(range(len(top_categories)), top_categories.index, rotation=45, ha='right')
# Add value labels on bars
for bar, count in zip(bars, top_categories.values):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(top_categories.values)*0.01,
f'{count:,}', ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.show()
# Return analysis results
analysis_results = {
'unique_count': unique_count,
'missing_percentage': missing_percentage,
'imbalance_ratio': imbalance_ratio,
'quality_issues': quality_issues,
'recommendations': recommendations,
'value_counts': value_counts.to_dict()
}
return analysis_results
analyzer = UnivariateAnalyzer()
classify_column(df['Product_Id'])
'ID or Unique Identifier'
Since this is nominal by nature nothing to explore for it
# Nature of column
classify_column(df['Product_Weight'])
'Continuous Numerical'
results = analyzer.analyze_numerical_column(df, 'Product_Weight')
📊 NUMERICAL ANALYSIS: Product_Weight ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ count 8763.000 mean 12.654 std 2.217 min 4.000 25% 11.150 50% 12.660 75% 14.180 max 22.000 Name: Product_Weight, dtype: float64 📋 DISTRIBUTION ANALYSIS ------------------------------ Skewness: 0.018 Interpretation: Approximately Normal Kurtosis: 0.004 Interpretation: Normal-like tails 🔍 OUTLIER ANALYSIS ------------------------------ IQR Method: Lower Bound: 6.605 Upper Bound: 18.725 Outliers: 54 (0.6%) ❓ MISSING VALUES ------------------------------ Count: 0 (0.0%) 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Data appears clean, minimal preprocessing needed
pp.pprint(results)
{'kurtosis': 0.003881980226869697,
'missing_percentage': 0.0,
'outlier_percentage': 0.6162273194111606,
'recommendations': ['Data appears clean, minimal preprocessing needed'],
'skewness': 0.01751117204148118,
'statistics': {'25%': 11.15,
'50%': 12.66,
'75%': 14.18,
'count': 8763.0,
'max': 22.0,
'mean': 12.653792080337787,
'min': 4.0,
'std': 2.2173201376193634}}
🔍 Observations:
• Nearly perfect distribution: Skewness (0.02) and mean ≈ median indicate well-balanced, normal-like data distribution
• High data quality: Zero missing values and minimal outliers (0.6%) suggest clean, reliable weight measurements across products
• Preprocessing advantage: Excellent statistical properties mean minimal transformation needed
📌 NOTE: Scaling decision depends on final model choice
So What To Do with Product_Weight ?
No transformation → it’s already symmetric and clean
✅ But scale it (e.g., StandardScaler) along with other numeric features during preprocessing
# Nature of column
classify_column(df['Product_Sugar_Content'])
'Categorical'
results = analyzer.analyze_categorical_column(df, 'Product_Sugar_Content')
📊 CATEGORICAL ANALYSIS: Product_Sugar_Content ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ Total Values: 8,763 Unique Values: 4 Cardinality: Medium 📋 VALUE DISTRIBUTION ------------------------------ Top 10 Values: 1. Low Sugar: 4,885 (55.7%) 2. Regular: 2,251 (25.7%) 3. No Sugar: 1,519 (17.3%) 4. reg: 108 (1.2%) 🔍 DATA QUALITY CHECK ------------------------------ Missing Values: 0 (0.0%) Issues Found: ⚠️ Mix of abbreviated and full forms detected ⚖️ CLASS BALANCE ANALYSIS ------------------------------ Most Common: 55.7% Least Common: 1.2% Imbalance Ratio: 45.2:1 Status: Highly Imbalanced 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider One-Hot Encoding (low cardinality) 2. Clean data inconsistencies before encoding 3. (Imbalance) Consider grouping rare categories or stratified sampling
pp.pprint(results)
{'imbalance_ratio': 45.23148148148148,
'missing_percentage': 0.0,
'quality_issues': ['Mix of abbreviated and full forms detected'],
'recommendations': ['Consider One-Hot Encoding (low cardinality)',
'Clean data inconsistencies before encoding',
'(Imbalance) Consider grouping rare categories or '
'stratified sampling'],
'unique_count': 4,
'value_counts': {'Low Sugar': 4885,
'No Sugar': 1519,
'Regular': 2251,
'reg': 108}}
🔍 Observation: Product_Sugar_Content
'Regular' vs 'reg') → needs cleaning before use.Solution Strategy:
Instead:
reg → Regular)desc: space given on display for this product amongst all other products from same store
# Nature of column
classify_column(df['Product_Allocated_Area'])
'Continuous Numerical'
df['Product_Allocated_Area'].head()
0 0.027 1 0.144 2 0.031 3 0.112 4 0.010 Name: Product_Allocated_Area, dtype: float64
📌 NOTE : It's a ratio, hence value will always lie between 0 to 1
results = analyzer.analyze_numerical_column(df, 'Product_Allocated_Area')
📊 NUMERICAL ANALYSIS: Product_Allocated_Area ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ count 8763.000 mean 0.069 std 0.048 min 0.004 25% 0.031 50% 0.056 75% 0.096 max 0.298 Name: Product_Allocated_Area, dtype: float64 📋 DISTRIBUTION ANALYSIS ------------------------------ Skewness: 1.128 Interpretation: Highly Skewed Kurtosis: 1.292 Interpretation: Normal-like tails 🔍 OUTLIER ANALYSIS ------------------------------ IQR Method: Lower Bound: -0.067 Upper Bound: 0.194 Outliers: 104 (1.2%) ❓ MISSING VALUES ------------------------------ Count: 0 (0.0%) 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider Log transformation (right skewed) 2. Consider outlier treatment (moderate outliers)
⚡ Insights
pp.pprint(results)
{'kurtosis': 1.2917138525421574,
'missing_percentage': 0.0,
'outlier_percentage': 1.1868081707177907,
'recommendations': ['Consider Log transformation (right skewed)',
'Consider outlier treatment (moderate outliers)'],
'skewness': 1.1279000327813162,
'statistics': {'25%': 0.031,
'50%': 0.056,
'75%': 0.096,
'count': 8763.0,
'max': 0.298,
'mean': 0.0687863745292708,
'min': 0.004,
'std': 0.04820376849944802}}
🔍 Observations
⚠️ Preprocessing needed: Right skewness and moderate outliers indicate log/power transformation will improve model performance
Given its likely link to product visibility and sales, this feature is expected to be an important predictor.
# Nature of column
classify_column(df['Product_Type'])
'Categorical'
results = analyzer.analyze_categorical_column(df, 'Product_Type')
📊 CATEGORICAL ANALYSIS: Product_Type ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ Total Values: 8,763 Unique Values: 16 Cardinality: High 📋 VALUE DISTRIBUTION ------------------------------ Top 10 Values: 1. Fruits and Vegetables: 1,249 (14.3%) 2. Snack Foods: 1,149 (13.1%) 3. Frozen Foods: 811 (9.3%) 4. Dairy: 796 (9.1%) 5. Household: 740 (8.4%) 6. Baking Goods: 716 (8.2%) 7. Canned: 677 (7.7%) 8. Health and Hygiene: 628 (7.2%) 9. Meat: 618 (7.1%) 10. Soft Drinks: 519 (5.9%) 🔍 DATA QUALITY CHECK ------------------------------ Missing Values: 0 (0.0%) ✅ No major quality issues detected ⚖️ CLASS BALANCE ANALYSIS ------------------------------ Most Common: 14.3% Least Common: 0.9% Imbalance Ratio: 16.4:1 Status: Highly Imbalanced 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider Target Encoding or Embedding (high cardinality) 2. (Imbalance) Consider grouping rare categories or stratified sampling
table_it(results['value_counts'], show_index=True)
# Key Value --- --------------------- ------- 1 Fruits and Vegetables 1249 2 Snack Foods 1149 3 Frozen Foods 811 4 Dairy 796 5 Household 740 6 Baking Goods 716 7 Canned 677 8 Health and Hygiene 628 9 Meat 618 10 Soft Drinks 519 11 Breads 200 12 Hard Drinks 186 13 Others 151 14 Starchy Foods 141 15 Breakfast 106 16 Seafood 76
results
{'unique_count': 16,
'missing_percentage': 0.0,
'imbalance_ratio': 16.43421052631579,
'quality_issues': [],
'recommendations': ['Consider Target Encoding or Embedding (high cardinality)',
'(Imbalance) Consider grouping rare categories or stratified sampling'],
'value_counts': {'Fruits and Vegetables': 1249,
'Snack Foods': 1149,
'Frozen Foods': 811,
'Dairy': 796,
'Household': 740,
'Baking Goods': 716,
'Canned': 677,
'Health and Hygiene': 628,
'Meat': 618,
'Soft Drinks': 519,
'Breads': 200,
'Hard Drinks': 186,
'Others': 151,
'Starchy Foods': 141,
'Breakfast': 106,
'Seafood': 76}}
🔍 Observations Product_Type:
• Diverse product portfolio: 16 categories ranging from dominant fresh produce (1,249 items) to niche seafood (76 items) - reflects SuperKart's full supermarket range
• Clear category hierarchy: Fresh items (Fruits/Vegetables, Snacks) dominate inventory, while specialty items (Seafood, Breakfast) form smaller segments
• High cardinality challenge: 16 categories with 16:1 imbalance ratio requires Target Encoding rather than One-Hot to avoid sparse features
• Business opportunity: Rare categories like Seafood might benefit from strategic expansion or could be grouped with similar categories for modeling
💡 Group Rare Categories (Valid Option), ie can combine rare product types into an "Other" category. (esp for OHE)
🧠 NOTE: One-Hot Encoding Problem:
Creates 16 binary columns:
Issues:
Target Encoding Solution:
Creates 1 numerical column based on average sales:
Benefits:
Analogy:
One-Hot = Giving each person a separate yes/no checkbox
Target = Giving each person a single "performance score"
Target Encoding captures the relationship while staying compact!
Thumb Rule
Linear Models: OHE preferred (avoids false ordinal)
Tree Models: Either works (trees find splits, don't assume order)
So for Product_Type: OHE after grouping rare categories is the safer approach.
Linear models are associated with OHE because of how they handle coefficients.
# nature of column
classify_column(df['Product_MRP'])
'Continuous Numerical'
results = analyzer.analyze_numerical_column(df, 'Product_MRP')
📊 NUMERICAL ANALYSIS: Product_MRP ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ count 8763.000 mean 147.033 std 30.694 min 31.000 25% 126.160 50% 146.740 75% 167.585 max 266.000 Name: Product_MRP, dtype: float64 📋 DISTRIBUTION ANALYSIS ------------------------------ Skewness: 0.037 Interpretation: Approximately Normal Kurtosis: -0.012 Interpretation: Normal-like tails 🔍 OUTLIER ANALYSIS ------------------------------ IQR Method: Lower Bound: 64.022 Upper Bound: 229.723 Outliers: 57 (0.7%) ❓ MISSING VALUES ------------------------------ Count: 0 (0.0%) 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Data appears clean, minimal preprocessing needed
pp.pprint(results)
{'kurtosis': -0.01226820030102882,
'missing_percentage': 0.0,
'outlier_percentage': 0.6504621704895585,
'recommendations': ['Data appears clean, minimal preprocessing needed'],
'skewness': 0.03650675645590474,
'statistics': {'25%': 126.16,
'50%': 146.74,
'75%': 167.585,
'count': 8763.0,
'max': 266.0,
'mean': 147.03253908478834,
'min': 31.0,
'std': 30.69411031174993}}
⚡ Insights
Graph seems normally distributed
The distribution is fairly symmetric (low skewness) and normal-like, indicating good spread across price ranges.
Excellent data quality: No missing values and minimal outliers (0.65%) suggest consistent, reliable pricing data collection
Business insight: Price range (₹31-₹266) with ₹147 average reflects diverse product portfolio from budget to premium segments
Preprocessing advantage: Clean statistical properties require minimal transformation - can use raw MRP values directly in modeling
💡 NOTE: ✅ So, even clean numeric features like this should be scaled (e.g., StandardScaler) when using linear models
# nature of column
classify_column(df['Store_Id'])
'Categorical'
analyzer.analyze_categorical_column(df, 'Store_Id')
📊 CATEGORICAL ANALYSIS: Store_Id ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ Total Values: 8,763 Unique Values: 4 Cardinality: Medium 📋 VALUE DISTRIBUTION ------------------------------ Top 10 Values: 1. OUT004: 4,676 (53.4%) 2. OUT001: 1,586 (18.1%) 3. OUT003: 1,349 (15.4%) 4. OUT002: 1,152 (13.1%) 🔍 DATA QUALITY CHECK ------------------------------ Missing Values: 0 (0.0%) ✅ No major quality issues detected ⚖️ CLASS BALANCE ANALYSIS ------------------------------ Most Common: 53.4% Least Common: 13.1% Imbalance Ratio: 4.1:1 Status: Reasonably Balanced 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider One-Hot Encoding (low cardinality)
{'unique_count': 4,
'missing_percentage': 0.0,
'imbalance_ratio': 4.059027777777777,
'quality_issues': [],
'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
'value_counts': {'OUT004': 4676,
'OUT001': 1586,
'OUT003': 1349,
'OUT002': 1152}}
⚡ Insights
Store_Id is a nominal identifier - it doesn't carry inherent predictive value.
Including it would cause the model to memorize store specific patterns, which hurts generalization to unseen data or stores.
In real-world deployment, we can't rely on store specific IDs to predict - we want the model to learn from store attributes like type, size, and location.
🔍 Observations
Uneven store representation: OUT004 dominates dataset (53% of records) while OUT002 has least data (13%) - reflects different store sizes or product ranges
Moderate imbalance: 4:1 ratio suggests some stores contribute significantly more product-store combinations than others.
EDA insight: Store representation differences may reflect business reality (flagship vs smaller outlets) rather than data collection bias
✅ Drop Store_Id - it adds noise and risks overfitting
# nature of column
classify_column(df['Store_Establishment_Year'])
'Ordinal or Discrete (Numerical)'
results = analyzer.analyze_categorical_column(df, 'Store_Establishment_Year')
📊 CATEGORICAL ANALYSIS: Store_Establishment_Year ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ Total Values: 8,763 Unique Values: 4 Cardinality: Medium 📋 VALUE DISTRIBUTION ------------------------------ Top 10 Values: 1. 2009: 4,676 (53.4%) 2. 1987: 1,586 (18.1%) 3. 1999: 1,349 (15.4%) 4. 1998: 1,152 (13.1%) 🔍 DATA QUALITY CHECK ------------------------------ Missing Values: 0 (0.0%) ✅ No major quality issues detected ⚖️ CLASS BALANCE ANALYSIS ------------------------------ Most Common: 53.4% Least Common: 13.1% Imbalance Ratio: 4.1:1 Status: Reasonably Balanced 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider One-Hot Encoding (low cardinality)
⚡ Insights
Considering as numeric column -
📌 Risk: It introduces a false sense of continuity or ordering if the relationship is not actually linear.
🧠 Domain KnowHow
Business Context Consideration:
What really matters for sales ?
Recommended Approach: (Feat Eng)
Store_Age transformation captures the business logic better than raw establishment year.
# nature of column
classify_column(df['Store_Size'])
'Categorical'
results = analyzer.analyze_categorical_column(df, 'Store_Size')
📊 CATEGORICAL ANALYSIS: Store_Size ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ Total Values: 8,763 Unique Values: 3 Cardinality: Medium 📋 VALUE DISTRIBUTION ------------------------------ Top 10 Values: 1. Medium: 6,025 (68.8%) 2. High: 1,586 (18.1%) 3. Small: 1,152 (13.1%) 🔍 DATA QUALITY CHECK ------------------------------ Missing Values: 0 (0.0%) ✅ No major quality issues detected ⚖️ CLASS BALANCE ANALYSIS ------------------------------ Most Common: 68.8% Least Common: 13.1% Imbalance Ratio: 5.2:1 Status: Moderately Imbalanced 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider One-Hot Encoding (low cardinality)
pp.pprint(results)
{'imbalance_ratio': 5.230034722222221,
'missing_percentage': 0.0,
'quality_issues': [],
'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
'unique_count': 3,
'value_counts': {'High': 1586, 'Medium': 6025, 'Small': 1152}}
⚡ Insights
Medium stores dominate: 69% of product-store combinations come from medium-sized stores, suggesting SuperKart's primary business model focuses on mid-tier outlets
Balanced representation: All three size categories have substantial data (1,000+ records each) ensuring reliable statistical analysis for each store type.
Natural ordering present: Small < Medium < High represents clear business hierarchy that should be preserved in modeling
Prefer Ordinal Encoding for this column !
# nature of column
classify_column(df['Store_Location_City_Type'])
'Categorical'
results = analyzer.analyze_categorical_column(df, 'Store_Location_City_Type')
📊 CATEGORICAL ANALYSIS: Store_Location_City_Type ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ Total Values: 8,763 Unique Values: 3 Cardinality: Medium 📋 VALUE DISTRIBUTION ------------------------------ Top 10 Values: 1. Tier 2: 6,262 (71.5%) 2. Tier 1: 1,349 (15.4%) 3. Tier 3: 1,152 (13.1%) 🔍 DATA QUALITY CHECK ------------------------------ Missing Values: 0 (0.0%) ✅ No major quality issues detected ⚖️ CLASS BALANCE ANALYSIS ------------------------------ Most Common: 71.5% Least Common: 13.1% Imbalance Ratio: 5.4:1 Status: Moderately Imbalanced 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider One-Hot Encoding (low cardinality)
results
{'unique_count': 3,
'missing_percentage': 0.0,
'imbalance_ratio': 5.4357638888888875,
'quality_issues': [],
'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
'value_counts': {'Tier 2': 6262, 'Tier 1': 1349, 'Tier 3': 1152}}
⚡ Insights
Tier 2 city focus: 71% of SuperKart's operations concentrate in Tier 2 cities, indicating strategic positioning in mid-tier urban markets
Balanced tier representation: Both Tier 1 and Tier 3 cities have similar presence (15% and 13% respectively), suggesting comprehensive market coverage
Natural city hierarchy: Tier 1 > Tier 2 > Tier 3 represents clear economic progression that should be preserved in modeling
Strategic insight: Heavy Tier 2 presence suggests SuperKart targets growing urban centers rather than just premium or budget markets
Prefer Ordinal encoding for this column !
# nature
classify_column(df['Store_Type'])
'Categorical'
results = analyzer.analyze_categorical_column(df, 'Store_Type')
📊 CATEGORICAL ANALYSIS: Store_Type ================================================== 📈 DESCRIPTIVE STATISTICS ------------------------------ Total Values: 8,763 Unique Values: 4 Cardinality: Medium 📋 VALUE DISTRIBUTION ------------------------------ Top 10 Values: 1. Supermarket Type2: 4,676 (53.4%) 2. Supermarket Type1: 1,586 (18.1%) 3. Departmental Store: 1,349 (15.4%) 4. Food Mart: 1,152 (13.1%) 🔍 DATA QUALITY CHECK ------------------------------ Missing Values: 0 (0.0%) ✅ No major quality issues detected ⚖️ CLASS BALANCE ANALYSIS ------------------------------ Most Common: 53.4% Least Common: 13.1% Imbalance Ratio: 4.1:1 Status: Reasonably Balanced 🔧 PREPROCESSING RECOMMENDATIONS ------------------------------ 1. Consider One-Hot Encoding (low cardinality)
pp.pprint(results)
{'imbalance_ratio': 4.059027777777777,
'missing_percentage': 0.0,
'quality_issues': [],
'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
'unique_count': 4,
'value_counts': {'Departmental Store': 1349,
'Food Mart': 1152,
'Supermarket Type1': 1586,
'Supermarket Type2': 4676}}
⚡ Insights
This is a categorical (nominal) variable with 4 unique types.
Supermarket Type2 dominance: 53% of operations are Type2 format, indicating this is SuperKart's primary business model across locations
Other formats still have 1000+ samples, ensuring reliable analysis at-least.
One-Hot Encoding is suitable here
class EDAHelper:
"""
Helper class for common EDA visualizations
"""
@staticmethod
def plot_scatter_with_trend(df, x_col_name, y_col_name, x_label, y_label, title=None, color='skyblue', show_it=False):
"""
Create scatter plot with trend line
Args:
df: DataFrame
x_col_name: x-axis column name
y_col_name: y-axis column name
x_label: x-axis label
y_label: y-axis label
title: plot title (optional)
figsize: figure size tuple
color: scatter plot color
show_it: whether to show the plot
Returns:
correlation coefficient
"""
x_col = df[x_col_name]
y_col = df[y_col_name]
# Scatter plot
plt.scatter(x_col, y_col, alpha=0.6, color=color)
# Trend line
z = np.polyfit(x_col, y_col, 1)
p = np.poly1d(z)
plt.plot(x_col, p(x_col), "r--", alpha=0.8)
# Labels and formatting
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.title(title or f'{x_label} vs {y_label}')
plt.grid(True, alpha=0.3)
# Correlation
correlation = x_col.corr(y_col)
plt.text(0.05, 0.95, f'Correlation: {correlation:.3f}',
transform=plt.gca().transAxes, bbox=dict(boxstyle="round", facecolor='white'))
if show_it:
plt.tight_layout()
plt.show()
return correlation
@staticmethod
def plot_binned_analysis(df, x_col_name, y_col_name, x_label, y_label, bins=5, labels=None, title=None, color='lightcoral', show_it=False):
"""
Create binned analysis bar plot
Args:
df: DataFrame
x_col_name: x-axis column name
y_col_name: y-axis column name
x_label: x-axis label
y_label: y-axis label
bins: number of bins or custom bin edges
labels: custom labels for bins
title: plot title
figsize: figure size
color: bar color
show_it: whether to show the plot
Returns:
binned summary series
"""
# plt.figure(figsize=figsize)
x_col = df[x_col_name]
# Create bins
if labels is None:
labels = [f'Bin_{i+1}' for i in range(bins)]
feat_bins = pd.cut(x_col, bins=bins, labels=labels)
binned_summary = df.groupby(feat_bins)[y_col_name].mean()
# Bar plot
bars = plt.bar(binned_summary.index, binned_summary.values, color=color)
plt.xlabel(x_label + ' Categories')
plt.ylabel(f'Average {y_label}')
plt.title(title or f'Average {y_label} by {x_label} Bins')
plt.xticks(rotation=45)
# Add value labels on bars
for bar, value in zip(bars, binned_summary.values):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(binned_summary.values)*0.01,
f'{value:.0f}', ha='center', va='bottom')
if show_it:
plt.tight_layout()
plt.show()
return binned_summary
eda_helper = EDAHelper()
Target Relationship Analysis
🔭 Possible outcome
# Product_MRP vs Sales
def analyze_mrp_vs_sales(df):
"""
Analyze the relationship between Product MRP and Total Sales.
Plots:
1. Scatter plot with trend line.
2. Bar plot of average sales per MRP bin.
Returns:
float: Correlation coefficient.
pd.Series: Average sales per MRP bin.
"""
# plt.figure(figsize=(12, 6))
mrp_col = "Product_MRP"
sales_col = "Product_Store_Sales_Total"
x_label = "MRP"
y_label = "Sales"
# Scatter plot with trend line
plt.subplot(1, 2, 1)
correlation = eda_helper.plot_scatter_with_trend(
df, mrp_col, sales_col, x_label, y_label
)
# Binned analysis
plt.subplot(1, 2, 2)
binned_summary = eda_helper.plot_binned_analysis(
df,
mrp_col,
sales_col,
x_label,
y_label,
bins=5,
labels=["Very Low", "Low", "Medium", "High", "Very High"],
)
plt.tight_layout()
plt.show()
return correlation, binned_summary
# Run analysis
correlation, mrp_summary = analyze_mrp_vs_sales(df)
print(f"Correlation coefficient: {correlation:.3f}")
print("\nAverage Sales by MRP Range:")
print(mrp_summary.round(0))
Correlation coefficient: 0.788 Average Sales by MRP Range: Product_MRP Very Low 1272.0 Low 2445.0 Medium 3493.0 High 4570.0 Very High 5808.0 Name: Product_Store_Sales_Total, dtype: float64
⚡ Insights
Strong positive relationship: Correlation of 0.788 indicates higher-priced products consistently generate more sales revenue
Clear pricing hierarchy: Sales nearly quintuple from very low (₹1,272) to very high (₹5,808) MRP ranges, suggesting effective premium pricing strategy
Linear trend confirmed: Consistent step-up across all price categories validates MRP as a strong predictor for sales forecasting
Business insight: SuperKart's pricing strategy aligns well with market demand - premium products deliver proportionally higher returns
🔭 Possible outcome
# Product_Allocated_Area vs Sales
def analyze_area_vs_sales(df):
"""
Analyze relationship between Product Allocated Area and Sales
"""
x_col_name = "Product_Allocated_Area"
y_col_name = "Product_Store_Sales_Total"
x_label = "Allocated Area"
y_label = "Sales"
# Correlation coefficient
plt.subplot(1, 2, 1)
correlation = eda_helper.plot_scatter_with_trend(
df, x_col_name, y_col_name, x_label, y_label
)
# Binned analysis
plt.subplot(1, 2, 2)
binned_summary = eda_helper.plot_binned_analysis(
df,
x_col_name,
y_col_name,
x_label,
y_label,
bins=5,
labels=["Very Small", "Small", "Medium", "Large", "Very Large"],
)
plt.tight_layout()
plt.show()
return correlation, binned_summary
# Run analysis
correlation, area_summary = analyze_area_vs_sales(df)
print(f"Correlation coefficient: {correlation:.3f}")
print("\nAverage Sales by MRP Range:")
print(area_summary.round(0))
Correlation coefficient: -0.001 Average Sales by MRP Range: Product_Allocated_Area Very Small 3458.0 Small 3498.0 Medium 3412.0 Large 3454.0 Very Large 3475.0 Name: Product_Store_Sales_Total, dtype: float64
⚡ Insights
No relationship found: Near-zero correlation (-0.001) indicates shelf space allocation doesn't drive sales performance
All allocation categories show similar sales (₹3,412-₹3,498), suggesting current allocation strategy is already optimized
Contradicts retail assumptions: Unlike typical retail wisdom, more display area doesn't translate to higher sales at SuperKart
Product_Allocated_Area may be less predictive than expected
🔭 Possible outcome
sns.boxplot(data=df, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=45)
plt.title('Sales Distribution by Product Type')
plt.xlabel('Product Type')
plt.ylabel('Sales Total (₹)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
avg_sales = df.groupby('Product_Type')['Product_Store_Sales_Total'].mean().sort_values(ascending=False)
sns.barplot(x=avg_sales.index, y=avg_sales.values)
plt.xticks(rotation=45)
plt.title('Average Sales by Product Type')
plt.xlabel('Product Type')
plt.ylabel('Average Sales (₹)')
# Add value labels
for index, value in enumerate(avg_sales.values):
plt.text(index, value + 50, f'{value:.0f}', ha='center', va='bottom')
plt.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
avg_sales
Product_Type Starchy Foods 3679.251418 Others 3586.068212 Seafood 3584.263684 Breads 3574.711200 Dairy 3532.560352 Snack Foods 3471.711880 Household 3465.865095 Frozen Foods 3464.834562 Soft Drinks 3462.513911 Health and Hygiene 3445.393646 Meat 3445.326764 Fruits and Vegetables 3443.421353 Baking Goods 3425.958101 Breakfast 3416.324623 Canned 3397.463383 Hard Drinks 3364.594731 Name: Product_Store_Sales_Total, dtype: float64
⚡ Insights
Narrow performance range: All categories perform within ₹315 range (₹3,365-₹3,679), indicating product type has modest impact on sales compared to MRP
Staple foods lead: Top performers (Starchy Foods, Others, Seafood, Breads, Dairy) suggest essential/cooking ingredients drive higher sales
Beverages underperform: Hard Drinks rank lowest while Soft Drinks are mid-tier, indicating different consumption patterns across drink categories
Grouping validation: 🤔 Small performance differences between many categories statistically support combining rare categories during preprocessing
🔭 Possible outcome
# Box Plot
sns.boxplot(data=df, x='Store_Size', y='Product_Store_Sales_Total')
plt.title('Sales Distribution across Store Sizes')
plt.xlabel('Store Size')
plt.ylabel('Sales Total (₹)')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
avg_sales = df.groupby('Store_Size')['Product_Store_Sales_Total'].mean().sort_values(ascending=False)
sns.barplot(x=avg_sales.index, y=avg_sales.values)
plt.xticks(rotation=45)
plt.title('Average Sales by Product Type')
plt.xlabel('Product Type')
plt.ylabel('Average Sales (₹)')
# Add value labels
for index, value in enumerate(avg_sales.values):
plt.text(index, value + 50, f'{value:.0f}', ha='center', va='bottom')
plt.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
avg_sales
Store_Size High 3923.778802 Medium 3668.222573 Small 1762.942465 Name: Product_Store_Sales_Total, dtype: float64
⚡ Insights
Strong size advantage: High stores generate 2.2x more sales than Small stores (₹3,924 vs ₹1,763), confirming clear economies of scale
Small store challenge: Dramatic performance gap between Small (₹1,763) and Medium (₹3,668) suggests different business models rather than gradual scaling
Business insight: Store size is a strong predictor - expansion to larger formats could significantly boost SuperKart's revenue per location
🔭 Possible outcome
# Box Plot
sns.boxplot(data=df, x='Store_Location_City_Type', y='Product_Store_Sales_Total')
plt.title('Sales Distribution across Store Locations')
plt.xlabel('Store Location')
plt.ylabel('Sales Total (₹)')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Order by logical tier progression
tier_order = ['Tier 1', 'Tier 2', 'Tier 3']
tier_sales = df.groupby('Store_Location_City_Type')['Product_Store_Sales_Total'].mean().reindex(tier_order)
sns.barplot(x=tier_sales.index, y=tier_sales.values)
plt.xticks(rotation=45)
plt.title('Average Sales by Store Location')
plt.xlabel('Store Location')
plt.ylabel('Average Sales (₹)')
# Add value labels
for index, value in enumerate(tier_sales.values):
plt.text(index, value + 50, f'{value:.0f}', ha='center', va='bottom')
plt.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
tier_sales
Store_Location_City_Type Tier 1 4946.966323 Tier 2 3457.473109 Tier 3 1762.942465 Name: Product_Store_Sales_Total, dtype: float64
# Get tier cities frequencies in pct
tier_cities_freq = df['Store_Location_City_Type'].value_counts(normalize=True)
tier_cities_freq
Store_Location_City_Type Tier 2 0.714595 Tier 1 0.153943 Tier 3 0.131462 Name: proportion, dtype: float64
⚡ Insights
Clear tier hierarchy: Tier 1 cities deliver 43% higher sales than Tier 2 (₹4,947 vs ₹3,457), confirming economic tier advantages
Quality over quantity: 🧐 Despite Tier 2 cities having 71% of SuperKart's locations, Tier 1 stores significantly outperform - showing fewer premium locations beat many average ones.
Significant market gaps: 2.8x performance difference between Tier 1 and Tier 3 cities indicates distinct market dynamics rather than gradual scaling
Strategic insight: Tier 1 focus could maximize revenue, while Tier 3 performance matches Small stores suggesting similar constraints
MultiVariate Analysis
# Pivot table: Average sales for each (Product_Type, Store_Size) combination
pivot = df.pivot_table(index='Product_Type',
columns='Store_Size',
values='Product_Store_Sales_Total',
aggfunc='mean')
sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu")
plt.title("Average Sales by Product Type and Store Size")
plt.xlabel("Store Size")
plt.ylabel("Product Type")
plt.tight_layout()
plt.show()
pivot
| Store_Size | High | Medium | Small |
|---|---|---|---|
| Product_Type | |||
| Baking Goods | 3861.257647 | 3632.219959 | 1769.380208 |
| Breads | 4042.469667 | 3743.188299 | 1887.803043 |
| Breakfast | 3816.110000 | 3710.780370 | 1559.740000 |
| Canned | 3773.246891 | 3616.167383 | 1721.223409 |
| Dairy | 3991.784133 | 3753.251365 | 1720.078654 |
| Frozen Foods | 3933.498662 | 3646.352236 | 1785.108416 |
| Fruits and Vegetables | 3984.887387 | 3638.704218 | 1776.806905 |
| Hard Drinks | 4024.230000 | 3547.559576 | 1809.395000 |
| Health and Hygiene | 3815.836053 | 3697.496667 | 1809.459451 |
| Household | 3965.458060 | 3653.563518 | 1846.656500 |
| Meat | 3891.286769 | 3669.687406 | 1744.827701 |
| Others | 3999.260968 | 3808.747327 | 1728.196316 |
| Seafood | 4072.064615 | 3807.619811 | 1766.335000 |
| Snack Foods | 3990.803168 | 3654.852859 | 1748.750479 |
| Soft Drinks | 3873.100849 | 3654.380855 | 1674.328226 |
| Starchy Foods | 3763.874375 | 3899.850412 | 1670.415000 |
avg_sales
Store_Size High 3923.778802 Medium 3668.222573 Small 1762.942465 Name: Product_Store_Sales_Total, dtype: float64
⚡ Insights
# Pivot table for average sales per Product_Type and Store_Type
pivot = df.pivot_table(
index='Product_Type',
columns='Store_Type',
values='Product_Store_Sales_Total',
aggfunc='mean'
)
# Plot: Grouped bar
pivot.plot(kind='bar', figsize=(14, 6), width=0.8, colormap='crest')
plt.title('Average Sales by Product Type across Store Types')
plt.ylabel('Average Sales (₹)')
plt.xlabel('Product Type')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.legend(title='Store Type')
plt.tight_layout()
plt.show()
# Same pivot as above
sns.heatmap(pivot, annot=True, fmt=".0f", cmap='flare', linewidths=0.5)
plt.title('Heatmap of Average Sales: Product vs Store Type')
plt.xlabel('Store Type')
plt.ylabel('Product Type')
plt.tight_layout()
plt.show()
# Summary statistics
print("Top 3 combinations by sales:")
combinations = df.groupby(['Product_Type', 'Store_Type'])['Product_Store_Sales_Total'].mean().sort_values(ascending=False)
print(combinations.head(3).round(0))
Top 3 combinations by sales: Product_Type Store_Type Breads Departmental Store 5159.0 Starchy Foods Departmental Store 5126.0 Breakfast Departmental Store 5033.0 Name: Product_Store_Sales_Total, dtype: float64
pivot
| Store_Type | Departmental Store | Food Mart | Supermarket Type1 | Supermarket Type2 |
|---|---|---|---|---|
| Product_Type | ||||
| Baking Goods | 4968.769697 | 1769.380208 | 3861.257647 | 3288.535740 |
| Breads | 5158.586176 | 1887.803043 | 4042.469667 | 3317.316372 |
| Breakfast | 5033.372632 | 1559.740000 | 3816.110000 | 3305.469839 |
| Canned | 5027.168556 | 1721.223409 | 3773.246891 | 3281.982895 |
| Dairy | 4936.654759 | 1720.078654 | 3991.784133 | 3321.025945 |
| Frozen Foods | 4898.429672 | 1785.108416 | 3933.498662 | 3303.855717 |
| Fruits and Vegetables | 4930.975055 | 1776.806905 | 3984.887387 | 3302.713800 |
| Hard Drinks | 4815.665217 | 1809.395000 | 4024.230000 | 3240.544526 |
| Health and Hygiene | 4934.148090 | 1809.459451 | 3815.836053 | 3367.969790 |
| Household | 4897.024673 | 1846.656500 | 3965.458060 | 3320.104010 |
| Meat | 4914.525283 | 1744.827701 | 3891.286769 | 3222.389729 |
| Others | 4998.867188 | 1728.196316 | 3999.260968 | 3256.807681 |
| Seafood | 5025.960000 | 1766.335000 | 4072.064615 | 3411.659250 |
| Snack Foods | 4938.228172 | 1748.750479 | 3990.803168 | 3266.710081 |
| Soft Drinks | 4933.058108 | 1674.328226 | 3873.100849 | 3312.784765 |
| Starchy Foods | 5126.378571 | 1670.415000 | 3763.874375 | 3402.128841 |
⚡ Insights
Departmental Store dominance: Departmental Stores dominate sales across all product types, indicating stronger customer engagement or product availability.
Staple foods excel in premium formats: Breads, Starchy Foods, and Breakfast items perform exceptionally well in Departmental Stores, likely driven by routine or impulse buys, shows essential items thrive in full-service environments
Food Mart underperformance: Food Marts lag behind in sales for most categories, suggesting possible limitations in reach, assortment, or store scale.
Strategic insight: Expanding Departmental Store format could boost performance across entire product portfolio rather than optimizing specific product-store matches
Format-agnostic performance: Similar relative product rankings across store types suggest product characteristics matter more than format-specific synergies
🎯 Decision Flow
🔗 Pipeline Order
Transformation vs Scaling - Key Distinction:
Transformation = Changes distribution shape
Scaling = Changes range/magnitude but preserves shape
Simple Example:
Product_Weight: [8, 12, 16] kg
Product_MRP: [100, 150, 200] rupees
Without scaling: Linear regression might favor MRP (larger numbers)
With scaling: Both features get equal importance consideration
MinMaxScaler maintains relative distances (1 → 2 → 3 becomes 0 → 0.5 → 1)
StandardScaler maintains order too (but re-centers and re-spreads)
➡️ For linear models, StandardScaler is preferred due to better optimization behavior, even for encoded ordinal features.
NOTE:
If the values imply order, scale them.
If the values are just binary flags, don’t scale.
GOOD Practices (TIPS)
Keeps the pipeline uniform and consistent
Prevents this feature from being treated on a different scale than the rest
Avoids manual branching (i.e., "scale this numeric but not that one")
Scaling in Linear Regression Why Almost Always Needed:
Gradient Descent Behavior:
When to Scale/Transform
Memory Hook "Features with larger raw values get artificial loudness in the loss function - scaling makes everyone speak at the same volume."
📌 NOTE:
Classification algorithms:
Linear Regression:
In addition 👀
"Imbalance treatment" (like resampling or class weighting) is typically required for target variable imbalance in classification tasks — not for feature columns.
For categorical feature columns, it's natural for some categories to dominate. That's simply part of the data distribution.
📌 Point in Question:
“Due to high cardinality, encoding options like target encoding or learned embeddings may be more effective than simple one-hot encoding.”
Generally
✅ One-Hot Encoding (OHE) → Preferred for Linear Models (like Linear Regression), since they assume linearity and independence between features.
✅ Target Encoding → Works well with Tree-based models (like Random Forest, XGBoost), since they can handle ordinal relationships and benefit from the signal in the target.
💡 OHE can hurt tree models by increasing dimensionality unnecessarily, while target encoding may leak info in linear models.
# from ydata_profiling import ProfileReport
# # Quick outlier summary
# profile = ProfileReport(df, title="Outlier & Data Profile", explorative=True)
# profile.to_notebook_iframe() # or profile.to_file("report.html")
🤔 Why IQR over Z-score:
def detect_outliers_iqr(df):
"""
Detect outliers using IQR method (consistent with our EDA approach)
"""
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
outlier_summary = []
for col in numerical_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
outlier_count = len(outliers)
outlier_percentage = (outlier_count / len(df)) * 100
outlier_summary.append({
'Column': col,
'Outlier_Count': outlier_count,
'Outlier_Percentage': round(outlier_percentage, 2)
})
return pd.DataFrame(outlier_summary)
# Quick outlier check
outlier_results = detect_outliers_iqr(df)
print(outlier_results)
Column Outlier_Count Outlier_Percentage 0 Product_Weight 54 0.62 1 Product_Allocated_Area 104 1.19 2 Product_MRP 57 0.65 3 Store_Establishment_Year 0 0.00 4 Product_Store_Sales_Total 119 1.36
📌 Rule of thumb: If outliers < 5% of data, skewness between -1 and 1, and reasonable scale → no target preprocessing needed!
🔍 Observations
📊Key Observations:
Minimal outlier presence: All features show <1.5% outliers, indicating high data quality across SuperKart's dataset
Target variable outliers: 119 outliers (1.36%) in sales data likely represent legitimate high-performing product-store combinations, not data errors
Store_Establishment_Year clean: Zero outliers confirm only 4 establishment years with consistent data entry
Treatment recommendation: Keep outliers as they represent valid business scenarios (premium products, exceptional performance) rather than data quality issues
Preprocessing decision: No outlier treatment needed - low percentages suggest authentic extreme values rather than measurement errors
No outlier treatment is needed for the target variable (Product_Store_Sales_Total) because:
We will build and evaluate two models:
In this phase, we will:
ColumnTransformer.The final model pipelines (preprocessing + estimator) will be built during the modeling phase.
🧠 REMEMBER
🔁 Typical pipeline order:
Logical rationale:
Key insight: Transform before scaling because you want to scale the final transformed distribution, not transform already-scaled values.
Also for Target Column:
Interpretation Guideline (general - numeric) 🚀
| Question | Indicator |
|---|---|
| Scaling needed? | For numeric with large range or skew - Yes |
| Log transform? | If skew > 1 or < -1 |
| Outliers? | If outliers > 2–5% |
| Encoding? | Categorical with low cardinality → OneHot; ordinal → OrdinalEncoder |
Is it skewed? (|skew| > 1)
Are there outliers? (>5% extreme values)
🧠 REMEMBER
Encoded Categorical Features:
True Numerical Features:
Rule of Thumb:
One-hot encoded → Leave alone (just 0s and 1s)
Ordinal encoded → Maybe scale, but don't overthink it
Target encoded → Treat more like numerical
True numerical → Full preprocessing pipeline
Key insight: Don't lose sight of the original data nature just because encoding made it numeric!
def summarize_numeric_column(df, col):
series = df[col].dropna()
skewness = skew(series)
outliers = (abs(zscore(series)) > 3).sum()
outlier_pct = (outliers / len(series)) * 100
print(f"\n📈 Column: {col}")
print(f"Skewness: {skewness:.2f}")
print(f"Outlier Count (Z > 3): {outliers} ({outlier_pct:.2f}%)")
print(f"Min: {series.min()}, Max: {series.max()}")
print(f"Mean: {series.mean():.2f}, Std: {series.std():.2f}")
def summarize_categorical_column(df, col):
counts = df[col].value_counts()
percentages = df[col].value_counts(normalize=True) * 100
summary = pd.DataFrame({
'Count': counts,
'Percentage (%)': percentages.round(2)
})
imbalance_ratio = counts.max() / counts.min() if counts.min() > 0 else float('inf')
print(f"\n🪪 Column: {col}")
print(f"Unique Categories: {df[col].nunique()}")
print(f"Imbalance Ratio (max/min): {imbalance_ratio:.2f}")
print(summary)
Interpretation Guideline
| Metric | When to Act |
|---|---|
| Skewness > ±1 | Consider log or power transformation |
| Outlier % > 2–5% | Consider treatment (cap, impute, remove) |
| Imbalance Ratio > 4 | May affect encoding; be cautious |
| Unique Categories = 2–5 | OneHot/Ordinal (case-by-case) |
| Unique Categories > 10 | High cardinality — use frequency/target encoding if needed (for tree models only) |
💡 Z-Score Interpretation
NOTE:
As a general guideline:
So, Z-score is best used when skewness is close to 0 (i.e., data is approximately normal).
Z-Score assumes a normal distribution. For skewed data, use the IQR method instead.
summarize_numeric_column(df, 'Product_Weight')
📈 Column: Product_Weight Skewness: 0.02 Outlier Count (Z > 3): 24 (0.27%) Min: 4.0, Max: 22.0 Mean: 12.65, Std: 2.22
summarize_categorical_column(df, 'Product_Sugar_Content')
🪪 Column: Product_Sugar_Content
Unique Categories: 4
Imbalance Ratio (max/min): 45.23
Count Percentage (%)
Product_Sugar_Content
Low Sugar 4885 55.75
Regular 2251 25.69
No Sugar 1519 17.33
reg 108 1.23
reg is most likely a mislabel of Regular, So we need to treat it correctlysummarize_numeric_column(df, 'Product_Allocated_Area')
📈 Column: Product_Allocated_Area Skewness: 1.13 Outlier Count (Z > 3): 75 (0.86%) Min: 0.004, Max: 0.298 Mean: 0.07, Std: 0.05
PowerTransformer already standardizes the data by default, Hence Scaling also not needed for this !
summarize_categorical_column(df, 'Product_Type')
🪪 Column: Product_Type
Unique Categories: 16
Imbalance Ratio (max/min): 16.43
Count Percentage (%)
Product_Type
Fruits and Vegetables 1249 14.25
Snack Foods 1149 13.11
Frozen Foods 811 9.25
Dairy 796 9.08
Household 740 8.44
Baking Goods 716 8.17
Canned 677 7.73
Health and Hygiene 628 7.17
Meat 618 7.05
Soft Drinks 519 5.92
Breads 200 2.28
Hard Drinks 186 2.12
Others 151 1.72
Starchy Foods 141 1.61
Breakfast 106 1.21
Seafood 76 0.87
df.groupby("Product_Type")["Product_Store_Sales_Total"].mean()
Product_Type Baking Goods 3425.958101 Breads 3574.711200 Breakfast 3416.324623 Canned 3397.463383 Dairy 3532.560352 Frozen Foods 3464.834562 Fruits and Vegetables 3443.421353 Hard Drinks 3364.594731 Health and Hygiene 3445.393646 Household 3465.865095 Meat 3445.326764 Others 3586.068212 Seafood 3584.263684 Snack Foods 3471.711880 Soft Drinks 3462.513911 Starchy Foods 3679.251418 Name: Product_Store_Sales_Total, dtype: float64
Observation 👀
Reckoning and Insights 🧐
Grouping Challanges:
Statistical reality check:
If we consider domain grouping as
Hence for now,
🥇 Most Viable: Mapping to Others
NOTE:
However For More Nuance we can try below as well (ie introduce 2 new groups):
So In case if model seems performing not well, this feature can be tweaked for better grouping !!
# Check sales similarity for categories being merged into "Others"
def analyze_rare_category_grouping(df):
"""
Analyze the impact of merging rare categories into 'Others'
"""
# Define categories to merge
rare_categories = ['Breads', 'Hard Drinks', 'Starchy Foods', 'Breakfast', 'Seafood']
# Current "Others" performance
others_current = df[df['Product_Type'] == 'Others']['Product_Store_Sales_Total']
print("📊 CURRENT 'OTHERS' CATEGORY")
print(f"Count: {len(others_current)}")
print(f"Average Sales: {others_current.mean():.0f}")
# Rare categories individual performance
print("\n📋 RARE CATEGORIES TO MERGE")
rare_stats = []
total_count = 0
total_sales = 0
for category in rare_categories:
cat_data = df[df['Product_Type'] == category]['Product_Store_Sales_Total']
avg_sales = cat_data.mean()
count = len(cat_data)
rare_stats.append((category, count, avg_sales))
total_count += count
total_sales += cat_data.sum()
print(f"{category}: {count} samples, avg = {avg_sales:.0f}")
# Combined "Others" after merging
print("\n🔄 AFTER MERGING")
new_others_count = len(others_current) + total_count
new_others_avg = (others_current.sum() + total_sales) / new_others_count
print(f"New 'Others' Count: {new_others_count}")
print(f"New 'Others' Average: {new_others_avg:.0f}")
# Impact analysis
print("\n📈 IMPACT ANALYSIS")
print(f"Change in 'Others' average: {new_others_avg - others_current.mean():.0f}")
print(f"Range of rare categories: {min([x[2] for x in rare_stats]):.0f} - {max([x[2] for x in rare_stats]):.0f}")
return new_others_avg, rare_stats
# Run analysis
new_avg, rare_stats = analyze_rare_category_grouping(df)
📊 CURRENT 'OTHERS' CATEGORY Count: 151 Average Sales: 3586 📋 RARE CATEGORIES TO MERGE Breads: 200 samples, avg = 3575 Hard Drinks: 186 samples, avg = 3365 Starchy Foods: 141 samples, avg = 3679 Breakfast: 106 samples, avg = 3416 Seafood: 76 samples, avg = 3584 🔄 AFTER MERGING New 'Others' Count: 860 New 'Others' Average: 3530 📈 IMPACT ANALYSIS Change in 'Others' average: -56 Range of rare categories: 3365 - 3679
👀 Merging creates robust, well-balanced category without extreme distortions
Hence introduce new step maybe
group_rare_labelsor something
summarize_numeric_column(df, 'Product_MRP')
📈 Column: Product_MRP Skewness: 0.04 Outlier Count (Z > 3): 25 (0.29%) Min: 31.0, Max: 266.0 Mean: 147.03, Std: 30.69
Decision: Use raw values + scaling only.
summarize_categorical_column(df, 'Store_Establishment_Year')
🪪 Column: Store_Establishment_Year
Unique Categories: 4
Imbalance Ratio (max/min): 4.06
Count Percentage (%)
Store_Establishment_Year
2009 4676 53.36
1987 1586 18.10
1999 1349 15.39
1998 1152 13.15
🔍 Observations
With 4 discrete values:
✅ Best Choice for Store_Age: MinMaxScaler
Why?
summarize_categorical_column(df, 'Store_Size')
🪪 Column: Store_Size
Unique Categories: 3
Imbalance Ratio (max/min): 5.23
Count Percentage (%)
Store_Size
Medium 6025 68.75
High 1586 18.10
Small 1152 13.15
summarize_categorical_column(df, 'Store_Location_City_Type')
🪪 Column: Store_Location_City_Type
Unique Categories: 3
Imbalance Ratio (max/min): 5.44
Count Percentage (%)
Store_Location_City_Type
Tier 2 6262 71.46
Tier 1 1349 15.39
Tier 3 1152 13.15
NOTE: ✅
Even though Store_Location_City_Type is categorical in origin, here's why scaling is needed in the context of a linear model:
⚠️ Reason: Effect on Linear Model Coefficients
Once you apply ordinal encoding (Tier 1 → 1, Tier 2 → 2, Tier 3 → 3),
it becomes a numeric feature — the model will treat it like a real-valued input.
Linear models interpret magnitude and scale literally - a larger value means more influence unless scaled.
🧠 Key Points:
Tier 3 (3) vs Tier 1 (1) has 2x the weight in raw encoded form - linear models treat this as meaningful unless corrected by scaling.So, despite the feature’s categorical origin, scaling is needed post-encoding for:
This is especially important when using regularization (like in Ridge or Lasso).
💡 Scaling reduces feature dominance and improves convergence.
summarize_categorical_column(df, 'Store_Type')
🪪 Column: Store_Type
Unique Categories: 4
Imbalance Ratio (max/min): 4.06
Count Percentage (%)
Store_Type
Supermarket Type2 4676 53.36
Supermarket Type1 1586 18.10
Departmental Store 1349 15.39
Food Mart 1152 13.15
❗ Problem:
Saving a pipeline with a custom transformer causes error when loading:
refer: https://stackoverflow.com/questions/49621169/joblib-load-main-attributeerror
AttributeError: Can't get attribute 'MyTransformer' on <module 'main'>
✅ Why:
(__main__)✅ Best Practice:
🔁 TL;DR: Points to Remember
__main__ or notebookInstead of the hacky main reassignment, having the class in its own file with proper imports will ensure consistent module paths during both serialization and deserialization.
Eg of hacky assignment => `main.ClassName = ClassName # right one is reachable in local project (eg backend)
⚠️ The column ordering issue with CatBoost.
While pandas DataFrames are flexible with column access by name, the underlying CatBoost model expects features in the exact same order as it was trained with. This is a common gotcha when working with many ML models, not just CatBoost, since they ultimately work with numpy arrays where position matters more than labels.
📝 Lesson: Column Order Matters in ML Pipelines (Even with Column Names!)
🔍 Root Cause
✨ Tips
General Rule
get_feature_names_out() should ONLY return feature column namesReasoning
Key Points
transform()get_feature_names_out()Pipeline Context
Remember: get_feature_names_out() is about feature tracking, not target tracking. The target variable is handled separately in the ML pipeline.
Flow (Pipeline)
Pipeline.fit(X, y)
For Transformers in pipeline:
y is passed to their fit() method (as optional parameter)transform() only gets XX flows to next steppipeline.fit(X, y):
Step 1 (Transformer):
- fit(X, y) # y available but usually unused
- transform(X) # only X, produces X1
Step 2 (Transformer):
- fit(X1, y) # y still original
- transform(X1) # only X1, produces X2
Final Step (Model):
- fit(X2, y) # model gets transformed X2 and original y
Why this
y remains pristine for final model trainingThis is why transformers are designed to work primarily with X, even though y is technically available in their
fit()method.
# make dir for preprocessing helper functions
os.makedirs('preprocessing', exist_ok=True)
1 Data Preppers Helper file
%%writefile preprocessing/custom_data_prepper.py
from sklearn.base import BaseEstimator, TransformerMixin
from datetime import datetime
import numpy as np
class CustomDataPrepper(BaseEstimator, TransformerMixin):
"""
Custom transformer for preparing data in the SuperKart model pipeline.
that performs:
a. data cleaning
b. data preparation (ie before feature engineering)
---
For now
- Product_Sugar_Content: 'reg' → 'Regular'
- Store_Establishment_Year <-> Store_Age
- Unify col order
- Drop Nominal Columns
---
NOTE:
- This will work only on DataFrames (ie expects Column with Names to exists)
"""
# Constants for column names
STORE_AGE_COL = 'Store_Age'
# Define expected column order (for Predictors)
EXPECTED_X_COLUMNS = [
'Product_Weight',
'Product_Sugar_Content',
'Product_Allocated_Area',
'Product_Type',
'Product_MRP',
'Store_Size',
'Store_Location_City_Type',
'Store_Type',
'Store_Age', # transformed from Store_Establishment_Year
# 'Product_Store_Sales_Total' # target variable
]
def __init__(self):
"""Initialize the data cleaner"""
pass
def fit(self, X, y=None):
"""
Fit method (required for sklearn compatibility)
No fitting needed for data cleaning operations
"""
return self
def transform(self, X):
"""
Apply data cleaning transformations
Args:
X: DataFrame to clean
Returns:
DataFrame with cleaned data
"""
# Create a copy to avoid modifying original data
X_cleaned = X.copy()
# 1. Fix Product_Sugar_Content inconsistency (reg → Regular)
if 'Product_Sugar_Content' in X_cleaned.columns:
X_cleaned['Product_Sugar_Content'] = X_cleaned['Product_Sugar_Content'].replace(
'reg', 'Regular'
)
# 2. Store_Establishment_Year -> Store_Age
if 'Store_Establishment_Year' in X_cleaned.columns:
current_year = datetime.now().year
X_cleaned[self.STORE_AGE_COL] = np.maximum(current_year - X_cleaned['Store_Establishment_Year'], 0)
# drop original column
X_cleaned = X_cleaned.drop('Store_Establishment_Year', axis=1)
# 3. Drop Nominal Columns
X_cleaned = self.drop_columns(X_cleaned)
# 4. Unify column order (MIMP <--)
# ---
# !!
# Column order is critical for CatBoostRegressor as it relies on column positions
# If column order differs between training and inference, wrong features will be
# used leading to incorrect predictions
# ---
#available_expected_cols = [col for col in self.EXPECTED_COLUMNS if col in X_cleaned.columns]
X_cleaned = X_cleaned[self.EXPECTED_X_COLUMNS]
# Add other data cleaning rules here as needed
# Example: Fix whitespace, case issues, etc.
return X_cleaned
def get_feature_names_out(self, input_features=None):
"""
Return feature names for output features.
Required for sklearn compatibility.
IMPORTANT:
- This transformer STRICTLY expects a DataFrame with specific columns
- We intentionally return EXPECTED_X_COLUMNS as transform() ensures this structure
- If input structure doesn't match, transform() will fail earlier with KeyError
- This strict behavior is INTENDED to catch data structure mismatches early
Returns:
list: EXPECTED_X_COLUMNS - the fixed ordered list of feature names
that transform() guarantees to output
"""
# if input_features is None:
# return input_features
# # Create list of output features
# output_features = list(input_features)
# # Handle Store_Establishment_Year -> Store_Age transformation
# if 'Store_Establishment_Year' in output_features:
# output_features.remove('Store_Establishment_Year')
# output_features.append(self.STORE_AGE_COL)
# return output_features
# ?? Not we can return EXPECTED_X_COLUMNS as it's ensured in last step of transform()
return self.EXPECTED_X_COLUMNS
def drop_columns(self, X):
cols_to_drop = ['Product_Id', 'Store_Id']
existing_cols = [col for col in cols_to_drop if col in X.columns]
return X.drop(columns=existing_cols)
Overwriting preprocessing/custom_data_prepper.py
# !! This Cell was used just for testing !!
# # As earlier we wrote file to directory we can now import and use it here
# from preprocessing.custom_data_prepper import CustomDataPrepper
2. Debug Transformer Helper
%%writefile preprocessing/debug_transformer.py
from sklearn.base import BaseEstimator, TransformerMixin
'''
NOTE:
ENABLE_DEBUG: decide if to keep debugging feature on or off
[active]: is individual instance config (used esp to retain config whilst serialization)
'''
class DebugTransformer(BaseEstimator, TransformerMixin):
# This is global config
ENABLE_DEBUG = False
# This is a custom transformer to debug the data after preprocessing
# [active]: is individual instance config (used esp to retain config whilst serialization)
def __init__(
self,
message="Debug Step",
show_head=False,
n_rows=2,
active=True,
):
self.message = message
self.show_head = show_head
self.n_rows = n_rows
self.active = active
def fit(self, X, y=None):
return self
def transform(self, X):
if not self.ENABLE_DEBUG: # Global config
return X
if not self.active: # Individual instance/local config
return X
print(f"\n🪵 {self.message} ---- (D)")
if hasattr(X, "columns"):
# It's a Dataframe
print(f"Shape: {X.shape}")
print(f"Columns: {list(X.columns)}")
if self.show_head:
print(f"First {self.n_rows} rows:")
print(X.head(self.n_rows))
else:
# It's a numpy array
print("Data type: numpy array")
if self.show_head:
print("Sample row:")
print(X[: self.n_rows]) # First 3 rows of numpy array
# print(f"Shape: {X.shape}")
# print(f"Columns: {list(X.columns) if hasattr(X, 'columns') else 'No column names'}")
# print(f"Data types: {X.dtypes.to_dict() if hasattr(X, 'dtypes') else 'Array data'}")
# print("Sample data:")
# print(X.head(3))
return X
def get_feature_names_out(self, input_features=None):
return input_features
def turn_on(self):
'''
Turn on the debug feature for this instance
'''
self.active = True
def turn_off(self):
'''
Turn off the debug feature for this instance
'''
self.active = False
Overwriting preprocessing/debug_transformer.py
3. index file for preprocessing folder
%%writefile preprocessing/__init__.py
"""Preprocessing module for data preparation and debugging transformers."""
from .custom_data_prepper import CustomDataPrepper
from .debug_transformer import DebugTransformer
__all__ = [
'CustomDataPrepper',
'DebugTransformer',
]
# # Set default debug mode
# DebugTransformer.ENABLE_DEBUG = False
Overwriting preprocessing/__init__.py
✅ Now Our preprocessing helpers are ready !!
from preprocessing import DebugTransformer, CustomDataPrepper
⚡ NOTE: Remember to take same
preprocessingfolder tobackendso that deserialized model can find the classes correctly
def test_custom_data_prepper():
"""
Test the CustomDataPrepper with various scenarios including:
1. Basic transformations (Sugar Content, Store Age)
2. Column order enforcement
3. ID columns dropping
4. Error cases for missing required columns
"""
# Create sample data with all expected columns
test_data = pd.DataFrame({
# Required columns as per EXPECTED_X_COLUMNS
'Product_Weight': [100, 200, 150, 300, 250, 180],
'Product_Sugar_Content': ['Regular', 'reg', 'Low Sugar', 'No Sugar', 'reg', 'Low Sugar'],
'Product_Allocated_Area': [10, 15, 20, 25, 30, 35],
'Product_Type': ['Dairy', 'Breads', 'Snack Foods', 'Seafood', 'Others', 'Breakfast'],
'Product_MRP': [50, 30, 25, 45, 35, 40],
'Store_Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Large'],
'Store_Location_City_Type': ['Tier 1', 'Tier 2', 'Tier 1', 'Tier 3', 'Tier 2', 'Tier 1'],
'Store_Type': ['Grocery Store', 'Supermarket', 'Grocery Store', 'Supermarket', 'Grocery Store', 'Supermarket'],
'Store_Establishment_Year': [1985, 1990, 1995, 2000, 2025, 1987], # Will be transformed to Store_Age
# Additional columns that should be dropped
'Product_Id': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
'Store_Id': ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'],
})
# Test the transformer
prepper = CustomDataPrepper()
prepper.fit(test_data)
cleaned_data = prepper.transform(test_data)
# 1. Test Sugar Content Standardization
print("=== Product_Sugar_Content Cleaning ===")
print("Before:")
print(test_data['Product_Sugar_Content'].value_counts())
print("\nAfter:")
print(cleaned_data['Product_Sugar_Content'].value_counts())
# 2. Test Store Age Transformation
current_year = datetime.now().year
print("\n=== Store_Establishment_Year → Store_Age ===")
print("Before (Establishment Year):")
print(test_data['Store_Establishment_Year'].describe())
print("\nAfter (Store Age):")
print(cleaned_data['Store_Age'].describe())
# 3. Test Column Order
print("\n=== Column Order Verification ===")
print("Expected columns:", prepper.EXPECTED_X_COLUMNS)
print("Actual columns:", cleaned_data.columns.tolist())
assert list(cleaned_data.columns) == prepper.EXPECTED_X_COLUMNS, "Column order mismatch!"
# 4. Test ID Columns Dropping
print("\n=== ID Columns Dropping ===")
print("Dropped columns:", set(test_data.columns) - set(cleaned_data.columns))
assert 'Product_Id' not in cleaned_data.columns, "Product_Id should be dropped"
assert 'Store_Id' not in cleaned_data.columns, "Store_Id should be dropped"
# 5. Test get_feature_names_out
print("\n=== Feature Names Output ===")
output_features = prepper.get_feature_names_out(test_data.columns)
print("get_feature_names_out result:", output_features)
assert output_features == prepper.EXPECTED_X_COLUMNS, "Feature names mismatch!"
# 6. Test Error Case - Missing Required Columns
print("\n=== Testing Error Case ===")
try:
incomplete_data = test_data.drop(['Product_Weight', 'Store_Size'], axis=1)
prepper.transform(incomplete_data)
print("ERROR: Should have raised KeyError for missing columns!")
except KeyError as e:
print("Successfully caught error for missing required columns:", e)
print("\nAll tests completed successfully!")
# Run the tests
test_custom_data_prepper()
=== Product_Sugar_Content Cleaning ===
Before:
Product_Sugar_Content
reg 2
Low Sugar 2
Regular 1
No Sugar 1
Name: count, dtype: int64
After:
Product_Sugar_Content
Regular 3
Low Sugar 2
No Sugar 1
Name: count, dtype: int64
=== Store_Establishment_Year → Store_Age ===
Before (Establishment Year):
count 6.000000
mean 1997.000000
std 14.764823
min 1985.000000
25% 1987.750000
50% 1992.500000
75% 1998.750000
max 2025.000000
Name: Store_Establishment_Year, dtype: float64
After (Store Age):
count 6.000000
mean 28.000000
std 14.764823
min 0.000000
25% 26.250000
50% 32.500000
75% 37.250000
max 40.000000
Name: Store_Age, dtype: float64
=== Column Order Verification ===
Expected columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
Actual columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
=== ID Columns Dropping ===
Dropped columns: {'Store_Id', 'Product_Id', 'Store_Establishment_Year'}
=== Feature Names Output ===
get_feature_names_out result: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
=== Testing Error Case ===
Successfully caught error for missing required columns: "['Product_Weight', 'Store_Size'] not in index"
All tests completed successfully!
🧠 REMEMBER (reload imports)
When developing Python modules in Jupyter notebooks using
%%writefile, changes to the files won't automatically reflect in your notebook due to Python's module caching.
Hence use
importlib.reload(mymodule) # Refresh!
when need to manipulate code way !
# !! # This cell was used to Test and Tackle changes sync issues due to %%writefile & caching done by python whilst importing libraries
# # Either force reload the module:
# # Otherwise sometimes changes done via `%%writefile` wont sync
# import importlib
# import preprocessing.debug_transformer
# importlib.reload(preprocessing.debug_transformer)
# from preprocessing.debug_transformer import DebugTransformer
def test_debug_transformer():
"""
Test the DebugTransformer with various scenarios:
1. DataFrame handling
2. NumPy array handling
3. Debug control (global and instance level)
4. Different configurations
5. Feature names preservation
"""
from preprocessing.debug_transformer import DebugTransformer
# Helper function to capture print output
def capture_output(func):
# Redirect stdout to capture print statements
old_stdout = sys.stdout
output = StringIO()
sys.stdout = output
func()
# Restore stdout
sys.stdout = old_stdout
return output.getvalue()
# Test data
df = pd.DataFrame({
'numeric': [1, 2, 3, 4, 5],
'categorical': ['A', 'B', 'C', 'D', 'E'],
'float_col': [1.1, 2.2, 3.3, 4.4, 5.5]
})
np_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("=== Starting DebugTransformer Tests ===")
# 1. Test DataFrame handling
def test_dataframe():
debug = DebugTransformer(
message="Testing DataFrame",
show_head=True,
n_rows=2
)
DebugTransformer.ENABLE_DEBUG = True
result = debug.transform(df)
# Verify data unchanged
assert result.equals(df), "Transform should not modify DataFrame"
# Verify feature names preserved
assert list(debug.get_feature_names_out(df.columns)) == list(df.columns)
print("\nTesting DataFrame handling...")
output = capture_output(test_dataframe)
print("DataFrame test output:")
print(output)
# 2. Test NumPy array handling
def test_numpy():
debug = DebugTransformer(
message="Testing NumPy Array",
show_head=True,
n_rows=2
)
DebugTransformer.ENABLE_DEBUG = True
result = debug.transform(np_array)
# Verify data unchanged
assert np.array_equal(result, np_array), "Transform should not modify NumPy array"
print("\nTesting NumPy array handling...")
output = capture_output(test_numpy)
print("NumPy test output:")
print(output)
# 3. Test Debug Control
def test_debug_control():
debug = DebugTransformer(message="Debug Control Test")
# Test global control
DebugTransformer.ENABLE_DEBUG = False
output1 = capture_output(lambda: debug.transform(df))
assert output1 == "", "Should not output when debug disabled"
# Test instance control
DebugTransformer.ENABLE_DEBUG = True
debug.turn_off()
output2 = capture_output(lambda: debug.transform(df))
assert output2 == "", "Should not output when instance disabled"
print("\nTesting debug control...")
test_debug_control()
print("Debug control tests passed!")
# 4. Test Different Configurations
def test_configurations():
# Test with show_head=False
debug1 = DebugTransformer(
message="No Head Test",
show_head=False
)
DebugTransformer.ENABLE_DEBUG = True
output1 = capture_output(lambda: debug1.transform(df))
assert "First" not in output1, "Should not show head when show_head=False"
# Test with different n_rows
debug2 = DebugTransformer(
message="Custom Rows Test",
show_head=True,
n_rows=3
)
output2 = capture_output(lambda: debug2.transform(df))
assert "First 3 rows" in output2, "Should show correct number of rows"
print("\nTesting different configurations...")
test_configurations()
print("Configuration tests passed!")
# Reset global debug state
DebugTransformer.ENABLE_DEBUG = False
print("\n=== All DebugTransformer Tests Completed Successfully! ===")
# Run the tests
test_debug_transformer()
=== Starting DebugTransformer Tests === Testing DataFrame handling... DataFrame test output: 🪵 Testing DataFrame ---- (D) Shape: (5, 3) Columns: ['numeric', 'categorical', 'float_col'] First 2 rows: numeric categorical float_col 0 1 A 1.1 1 2 B 2.2 Testing NumPy array handling... NumPy test output: 🪵 Testing NumPy Array ---- (D) Data type: numpy array Sample row: [[1 2 3] [4 5 6]] Testing debug control... Debug control tests passed! Testing different configurations... Configuration tests passed! === All DebugTransformer Tests Completed Successfully! ===
class ModelResultTracker:
"""
Track and compare results across multiple models
"""
def __init__(self):
self.results = []
self.model_objects = {}
def add_result(self, model_name, model_type, train_metrics, test_metrics, model_object=None):
"""
Add model results to tracker
Args:
model_name: Name of the model
model_type: Type (Linear/Tree)
train_metrics: Training metrics dict
test_metrics: Testing metrics dict
model_object: Trained model object (optional)
"""
result = {
'Model_Name': model_name,
'Model_Type': model_type,
'Train_RMSE': train_metrics['RMSE'],
'Train_MAE': train_metrics['MAE'],
'Train_R2': train_metrics['R2'],
'Train_Adj_R2': train_metrics['Adj_R2'],
'Train_MAPE': train_metrics['MAPE'],
'Test_RMSE': test_metrics['RMSE'],
'Test_MAE': test_metrics['MAE'],
'Test_R2': test_metrics['R2'],
'Test_Adj_R2': test_metrics['Adj_R2'],
'Test_MAPE': test_metrics['MAPE'],
'Overfitting_Score': train_metrics['R2'] - test_metrics['R2'] # Simple overfitting check
}
self.results.append(result)
if model_object:
self.model_objects[model_name] = model_object
def get_results_summary_vertically(self):
"""
Get summary of all model results with:
- Metrics as rows
- Models as columns
"""
if not self.results:
print("No results to display!")
return None
# Create initial DataFrame
df_results = pd.DataFrame(self.results)
# Set Model_Name as index before transposing
df_results = df_results.set_index('Model_Name')
# Drop Model_Type as it will look odd in metrics
df_results = df_results.drop('Model_Type', axis=1)
# Transpose and sort index for better organization
df_results = df_results.T
# Optional: Sort index to group metrics together
# Create a custom sorter for metrics
metric_order = [
'Train_R2', 'Test_R2',
'Train_Adj_R2', 'Test_Adj_R2',
'Train_RMSE', 'Test_RMSE',
'Train_MAE', 'Test_MAE',
'Train_MAPE', 'Test_MAPE',
'Overfitting_Score'
]
df_results = df_results.reindex(metric_order)
return df_results
def get_results_summary_horizontally(self):
"""Get summary of all model results"""
if not self.results:
print("No results to display!")
return None
df_results = pd.DataFrame(self.results)
return df_results
def plot_model_comparison(self, metric='Test_RMSE'):
"""
Plot model comparison for specified metric
"""
if not self.results:
print("No results to plot!")
return
df_results = pd.DataFrame(self.results)
plt.figure(figsize=(12, 6))
# Bar plot for model comparison
colors = ['skyblue' if 'Regression' in model_type else 'lightcoral'
for model_type in df_results['Model_Type']]
bars = plt.bar(df_results['Model_Name'], df_results[metric], color=colors)
plt.xlabel('Model')
plt.ylabel(metric)
plt.title(f'Model Comparison - {metric}')
plt.xticks(rotation=45, ha='right')
# Add value labels on bars
for bar, value in zip(bars, df_results[metric]):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{value:.2f}', ha='center', va='bottom')
# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='skyblue', label='Linear Model'),
Patch(facecolor='lightcoral', label='Tree Model')]
plt.legend(handles=legend_elements)
plt.tight_layout()
plt.show()
def get_best_model(self, metric='Test_R2', higher_better=True):
"""
Get best performing model based on specified metric
"""
if not self.results:
print("No results available!")
return None
df_results = pd.DataFrame(self.results)
if higher_better:
best_idx = df_results[metric].idxmax()
else:
best_idx = df_results[metric].idxmin()
best_model_name = df_results.loc[best_idx, 'Model_Name']
best_score = df_results.loc[best_idx, metric]
print(f"🏆 BEST MODEL: {best_model_name}")
print(f"📊 {metric}: {best_score:.4f}")
return best_model_name, best_score
def consolidate_categories(X):
"""Function to group rare labels in Categorical Column
Currently:
- Product_Type
NOTE: Since this not needed fit(), using as a function directly (ie Static Logic)
"""
X_copy = X.copy()
rare_categories = ['Breads', 'Hard Drinks', 'Starchy Foods', 'Breakfast', 'Seafood']
X_copy['Product_Type'] = X_copy['Product_Type'].replace(rare_categories, 'Others')
return X_copy
# # In linear model pipeline only:
# FunctionTransformer(consolidate_categories)
Pipeline Facade Structures (general idea)
Pipeline(steps=[
('preprocessor', ColumnTransformer([
# Encoding / scaling per column group
])),
('model', LinearRegression or Ridge or ... )
])
Pipeline(steps=[
('feature_engineering', CustomTransformer()), # optional
('preprocessing', ColumnTransformer([...]), # encoding + scaling
('feature_selection', SelectKBest(...)), # optional
('model', Ridge()) # or CatBoost, etc.
])
Pipeline([
('preprocessing', ColumnTransformer([
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(), categorical_cols)
])),
('model', LinearRegression())
])
For Tree based models
Pipeline([
('model', CatBoostRegressor(...))
])
NOTE
ColumnTransformer is for:
Note: Since target is continuous, using simple random split instead of stratification.
def create_train_test_split(df, target_col='Product_Store_Sales_Total', test_size=0.2, random_state=42):
"""
Create train-test split with proper separation
Args:
df: Full dataset
target_col: Target variable name
test_size: Proportion for test set
random_state: For reproducibility
Returns:
X_train, X_test, y_train, y_test
"""
# Separate features and target
X = df.drop(columns=[target_col])
y = df[target_col]
print("📊 TRAIN-TEST SPLIT SUMMARY")
print(f"Total samples: {len(df):,}")
print(f"Features: {X.shape[1]}")
print(f"Target: {target_col}")
print(f"Split ratio: {int((1-test_size)*100)}/{int(test_size*100)}")
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=test_size,
random_state=random_state,
shuffle=True
)
print("\n📈 SPLIT RESULTS")
print(f"Training samples: {len(X_train):,} ({len(X_train)/len(df)*100:.1f}%)")
print(f"Test samples: {len(X_test):,} ({len(X_test)/len(df)*100:.1f}%)")
# Target distribution check
print("\n🎯 TARGET DISTRIBUTION")
print(f"Train - Mean: {y_train.mean():.0f}, Std: {y_train.std():.0f}")
print(f"Test - Mean: {y_test.mean():.0f}, Std: {y_test.std():.0f}")
return X_train, X_test, y_train, y_test
# Create split
X_train, X_test, y_train, y_test = create_train_test_split(df)
📊 TRAIN-TEST SPLIT SUMMARY Total samples: 8,763 Features: 11 Target: Product_Store_Sales_Total Split ratio: 80/20 📈 SPLIT RESULTS Training samples: 7,010 (80.0%) Test samples: 1,753 (20.0%) 🎯 TARGET DISTRIBUTION Train - Mean: 3465, Std: 1065 Test - Mean: 3461, Std: 1068
Residual plots in regression are equally helpful as accuracy/loss plots in classification, but they serve slightly different diagnostic purposes:
Residual plots: Diagnose model assumptions (linearity, homoscedasticity, normality) and reveal systematic error patterns
Accuracy/Loss plots: Diagnose training dynamics (overfitting, convergence, learning progression)
Both are essential for model validation - residual plots focus more on model assumptions and error structure, while accuracy/loss plots focus more on training behavior and generalization.
Multiple metrics & their perspectives:
They complement each other - no single metric tells the full story.
QQ Plot of Residuals
Evaluation/Gauge of Model
⚡ Train | Test | Validation
GridSearchCV does internal cross-validation on training data for hyperparameter selection, but this is different from your holdout test set.
Three distinct purposes:
The CV validation in GridSearchCV is for model selection, not final model assessment. You still need that clean test set for unbiased final evaluation.
Moreover, the model never sees 100% of the training data at once during training and tuning
Train metrics = model's performance on data it has now fully seen, while CV metrics = average validation performance across folds during tuning.
Train metrics purpose: Diagnose overfitting/underfitting
Train metrics alone are meaningless - their value comes from comparison with test/validation metrics to understand model behavior.
class ModelPerformanceAnalyzer:
"""
Comprehensive model performance analysis for regression tasks
"""
@staticmethod
def adj_r2_score(predictors, targets, predictions):
"""Calculate adjusted R-squared"""
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
@staticmethod
def calculate_metrics(model, X, y, dataset_name="Dataset"):
"""
Calculate comprehensive regression metrics
Args:
model: Trained model
X: Features
y: Actual targets
dataset_name: Name for identification
Returns:
Dictionary of metrics
"""
predictions = model.predict(X)
metrics = {
'Dataset': dataset_name,
'RMSE': mean_squared_error(y, predictions, squared=False),
'MAE': mean_absolute_error(y, predictions),
'R2': r2_score(y, predictions),
'Adj_R2': ModelPerformanceAnalyzer.adj_r2_score(X, y, predictions),
'MAPE': mean_absolute_percentage_error(y, predictions) * 100, # Convert to percentage
'Samples': len(y)
}
return metrics
@staticmethod
def analyze_residuals(model, X, y, title="Residual Analysis"):
"""
Create comprehensive residual analysis plots
"""
predictions = model.predict(X)
residuals = y - predictions
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle(f'{title}', fontsize=16, fontweight='bold')
# 1. Predicted vs Actual
axes[0, 0].scatter(y, predictions, alpha=0.6, color='blue')
axes[0, 0].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Values')
axes[0, 0].set_ylabel('Predicted Values')
axes[0, 0].set_title('Predicted vs Actual')
# Add R² on plot
r2 = r2_score(y, predictions)
axes[0, 0].text(0.05, 0.95, f'R² = {r2:.3f}', transform=axes[0, 0].transAxes,
bbox=dict(boxstyle="round", facecolor='white'), fontsize=12)
# 2. Residuals vs Predicted
axes[0, 1].scatter(predictions, residuals, alpha=0.6, color='green')
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted Values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residuals vs Predicted')
# 3. Residuals Distribution
axes[1, 0].hist(residuals, bins=30, alpha=0.7, color='orange', edgecolor='black')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Residual Distribution')
# Add normality test
_, p_value = stats.normaltest(residuals)
axes[1, 0].text(0.05, 0.95, f'Normality p-value: {p_value:.3f}',
transform=axes[1, 0].transAxes,
bbox=dict(boxstyle="round", facecolor='white'), fontsize=10)
# 4. Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Normality Check)')
plt.tight_layout()
plt.show()
return predictions, residuals
def evaluate_model_comprehensive(model, X_train, y_train, X_test, y_test, model_name, model_type, tracker):
"""
Comprehensive model evaluation workflow
"""
analyzer = ModelPerformanceAnalyzer()
# Calculate metrics
train_metrics = analyzer.calculate_metrics(model, X_train, y_train, "Train")
test_metrics = analyzer.calculate_metrics(model, X_test, y_test, "Test")
# Display metrics
print(f"\n📊 {model_name} PERFORMANCE")
print("=" * 50)
print("Training Metrics:")
for key, value in train_metrics.items():
if key != 'Dataset':
print(f" {key}: {value:.4f}")
print("\nTesting Metrics:")
for key, value in test_metrics.items():
if key != 'Dataset':
print(f" {key}: {value:.4f}")
# Residual analysis
analyzer.analyze_residuals(model, X_test, y_test, f"{model_name} - Test Set")
# Add to tracker
tracker.add_result(model_name, model_type, train_metrics, test_metrics, model)
return train_metrics, test_metrics
# Metric Selection Rationale
def get_metric_rationale():
"""
Provide rationale for metric selection
"""
rationale = """
📊 METRIC SELECTION RATIONALE FOR SUPERKART SALES FORECASTING:
🎯 PRIMARY METRIC: RMSE (Root Mean Square Error)
• Penalizes large prediction errors more heavily
• Important for business: Large forecast errors = inventory problems
• Scale-interpretable: Same units as target (₹)
📈 SECONDARY METRICS:
MAE (Mean Absolute Error):
• Business-friendly interpretation: Average error in rupees
• Less sensitive to outliers than RMSE
• Easy stakeholder communication
R² (R-squared):
• Percentage of variance explained by model
• Model explanatory power indicator
• Benchmarking against simple baseline
Adjusted R²:
• Accounts for number of features
• Prevents overfitting through feature count penalty
• Better for model selection with different feature sets
MAPE (Mean Absolute Percentage Error):
• Percentage-based error measurement
• Scale-independent comparison
• Business KPI compatibility
🎯 BUSINESS IMPACT:
Lower RMSE = Better inventory planning = Reduced stockouts & overstock costs
"""
print(rationale)
get_metric_rationale()
📊 METRIC SELECTION RATIONALE FOR SUPERKART SALES FORECASTING:
🎯 PRIMARY METRIC: RMSE (Root Mean Square Error)
• Penalizes large prediction errors more heavily
• Important for business: Large forecast errors = inventory problems
• Scale-interpretable: Same units as target (₹)
📈 SECONDARY METRICS:
MAE (Mean Absolute Error):
• Business-friendly interpretation: Average error in rupees
• Less sensitive to outliers than RMSE
• Easy stakeholder communication
R² (R-squared):
• Percentage of variance explained by model
• Model explanatory power indicator
• Benchmarking against simple baseline
Adjusted R²:
• Accounts for number of features
• Prevents overfitting through feature count penalty
• Better for model selection with different feature sets
MAPE (Mean Absolute Percentage Error):
• Percentage-based error measurement
• Scale-independent comparison
• Business KPI compatibility
🎯 BUSINESS IMPACT:
Lower RMSE = Better inventory planning = Reduced stockouts & overstock costs
Model Selection Notion: Linear Regression + CatBoost
Linear Regression:
CatBoost:
Why This Combination:
Result: Covers both ends of the ML spectrum - simple/interpretable vs complex/high-performance - ensuring robust model selection regardless of data complexity.
Though we can consider RandomForest (we can take it as extension or homework later | ie enhancements)
NOTE:
💡 Linear models need scaling, encoding, transformations, and outlier handling, while RF/CatBoost are more robust to raw data and handle categoricals natively.
Common helpers for Modeling
analyzer = ModelPerformanceAnalyzer()
tracker = ModelResultTracker()
def get_feature_names_from_pipeline(model):
'''
**Model Feature Tracing depends on the algorithm:**
- CatBoost: `feature_names_`
- XGBoost: `feature_names`
- LightGBM: `feature_name_`
- sklearn Pipelines: through `get_feature_names_out()`
**No direct feature names:**
- Basic sklearn models (LinearRegression, RandomForest, etc.)
- Need to track features externally or through pipeline
- Only have coefficients/importance without column names
So feature tracing varies by library/implementation - not universal across all models!
'''
try:
# If it's a pipeline
if hasattr(model, 'named_steps'):
preprocessor = model.named_steps["preprocessing"]
return preprocessor.get_feature_names_out()
# If it's a CatBoost model
elif hasattr(model, 'feature_names_'):
return model.feature_names_
# If it's another type of model
else:
return None
except Exception as e:
print(f"Could not get feature names: {e}")
return None
# TODO: rename to pipeline
def quick_model_eval(pipeline, X_train, y_train, X_test, y_test, model_name="Model"):
"""
Quick sanity check for fitted pipeline
Args:
pipeline: Fitted sklearn pipeline
X_train: Train features
y_train: Train targets
X_test: Test features
y_test: Test targets
model_name: Name for logging
"""
print(f"\n🚀 QUICK TEST: {model_name}")
print("-" * 40)
# Feature names
feature_names = get_feature_names_from_pipeline(pipeline)
if feature_names is not None:
print(f"Feature names: {feature_names}")
# # Predict
# y_pred_test = pipeline.predict(X_test)
# y_pred_train = pipeline.predict(X_train)
# Calculate metrics using your existing method
metrics_test = ModelPerformanceAnalyzer.calculate_metrics(pipeline, X_test, y_test)
metrics_train = ModelPerformanceAnalyzer.calculate_metrics(pipeline, X_train, y_train)
# Quick summary
print("\nTrain Metrics:--- ")
print(f"RMSE: {metrics_train['RMSE']:.2f}")
print(f"R2: {metrics_train['R2']:.3f}")
print(f"MAE: {metrics_train['MAE']:.2f}")
print("\nTest Metrics:---")
print(f"RMSE: {metrics_test['RMSE']:.2f}")
print(f"R2: {metrics_test['R2']:.3f}")
print(f"MAE: {metrics_test['MAE']:.2f}")
# Usage after fitting:
# linear_results = quick_model_test(linear_model_pipeline, X_test, y_test, "Linear Regression")
# catboost_results = quick_model_test(catboost_pipeline, X_test, y_test, "CatBoost")
def fit_model(model):
'''
Fit model managing the logs setup
'''
print('\n⚡ Model Training Started ---------\n')
DebugTransformer.ENABLE_DEBUG = True
model.fit(X_train, y_train)
DebugTransformer.ENABLE_DEBUG = False
print('\n⚡ Model Training Completed ---------')
df['Product_Store_Sales_Total'].describe()
count 8763.000000 mean 3464.003640 std 1065.630494 min 33.000000 25% 2761.715000 50% 3452.340000 75% 4145.165000 max 8000.000000 Name: Product_Store_Sales_Total, dtype: float64
✅ Good Model Performance Guide
With current target stats, a "decent" sales prediction model should ideally have
| Metric | Ballpark Range |
|---|---|
| RMSE | < 800 (decent), < 600 (good), < 450 (very good) |
| MAE | < 500 preferred |
| R² | > 0.7 is promising |
Baseline Comparison
Pro tip: Given data's moderate variability (31% CV), achieving 12-15% MAPE would be quite good for a sales prediction model!
m1_name = "Linear Model"
Pipeline order:
CustomDataPrepper → consolidate_categories → ColumnTransformer
NOTE:
📌 Use validate=False for FunctionTransformer when your function needs pandas DataFrame functionality. Most custom preprocessing functions need this setting.
# Step 1: Define the preprocessing pipeline
linear_preprocessing_pipeline = Pipeline([
# Step 1: Universal data cleaning & preparation
('data_prep', CustomDataPrepper()),
('debug1', DebugTransformer("After Data Cleaning")),
# Step 2: Category consolidation (Product_Type rare grouping)
('consolidate_rare', FunctionTransformer(
consolidate_categories,
# ?? Working with pandas dataframe, not numpy array, then use validate=False
validate=False,
feature_names_out='one-to-one' # Fixes feature_names_out error
)),
('debug2', DebugTransformer("After Category Consolidation")),
# Step 3: Column-specific transformations
('column_transform', ColumnTransformer([
# One-Hot Encoding (Nominal Categorical)
('onehot', OneHotEncoder(
drop='first', # Avoid multicollinearity
handle_unknown='ignore' # Handle unseen categories
), ['Product_Sugar_Content', 'Product_Type', 'Store_Type']),
# Power Transformation (Right-skewed numerical)
('power_transform', PowerTransformer(
method='yeo-johnson',
standardize=True # Built-in scaling
), ['Product_Allocated_Area']),
# Ordinal Encoding (Ordinal Categorical)
('ordinal', OrdinalEncoder(
categories=[
['Small', 'Medium', 'High'], # Store_Size
['Tier 3', 'Tier 2', 'Tier 1'] # Store_Location_City_Type
]
), ['Store_Size', 'Store_Location_City_Type']),
# Standard Scaling (Normal numerical)
('standard_scale', StandardScaler(),
['Product_Weight', 'Product_MRP', 'Store_Age']),
# # MinMax Scaling (Discrete numerical - Store_Age)
# ('minmax_scale', MinMaxScaler(),
# ['Store_Age'])
], remainder='drop')), # Drop any unspecified columns
('debug3', DebugTransformer("After Preprocessing (ie Column Transformation)")),
])
# Step 4: Complete Linear Regression Pipeline
linear_model_pipeline = Pipeline([
('preprocessing', linear_preprocessing_pipeline),
('model', LinearRegression())
], verbose=True)
fit_model(linear_model_pipeline)
⚡ Model Training Started --------- 🪵 After Data Cleaning ---- (D) Shape: (7010, 9) Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age'] 🪵 After Category Consolidation ---- (D) Shape: (7010, 9) Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age'] 🪵 After Preprocessing (ie Column Transformation) ---- (D) Data type: numpy array [Pipeline] ..... (step 1 of 2) Processing preprocessing, total= 0.0s [Pipeline] ............. (step 2 of 2) Processing model, total= 0.0s ⚡ Model Training Completed ---------
quick_model_eval(linear_model_pipeline, X_train, y_train, X_test, y_test, model_name="Linear Regression")
🚀 QUICK TEST: Linear Regression ---------------------------------------- Feature names: ['onehot__Product_Sugar_Content_No Sugar' 'onehot__Product_Sugar_Content_Regular' 'onehot__Product_Type_Canned' 'onehot__Product_Type_Dairy' 'onehot__Product_Type_Frozen Foods' 'onehot__Product_Type_Fruits and Vegetables' 'onehot__Product_Type_Health and Hygiene' 'onehot__Product_Type_Household' 'onehot__Product_Type_Meat' 'onehot__Product_Type_Others' 'onehot__Product_Type_Snack Foods' 'onehot__Product_Type_Soft Drinks' 'onehot__Store_Type_Food Mart' 'onehot__Store_Type_Supermarket Type1' 'onehot__Store_Type_Supermarket Type2' 'power_transform__Product_Allocated_Area' 'ordinal__Store_Size' 'ordinal__Store_Location_City_Type' 'standard_scale__Product_Weight' 'standard_scale__Product_MRP' 'standard_scale__Store_Age'] Train Metrics:--- RMSE: 449.16 R2: 0.822 MAE: 261.04 Test Metrics:--- RMSE: 435.23 R2: 0.834 MAE: 257.00
Key Highlights 🌟
Business Impact 💼
Overall, model generalizes well
| Metric | Value | Interpretation |
|---|---|---|
| RMSE | 435.23 | ✅ Very strong - only ~12.6% of target mean, ~40% of std |
| MAE | 257.00 | ✅ Excellent - very low average absolute error |
| R² | 0.834 | ✅ High - ~83% of variance explained |
Let's track this model
evaluate_model_comprehensive(
model=linear_model_pipeline,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
model_name="Linear Regression",
model_type="Regression",
tracker=tracker
)
📊 Linear Regression PERFORMANCE ================================================== Training Metrics: RMSE: 449.1570 MAE: 261.0367 R2: 0.8221 Adj_R2: 0.8218 MAPE: 9.9978 Samples: 7010.0000 Testing Metrics: RMSE: 435.2252 MAE: 257.0005 R2: 0.8340 Adj_R2: 0.8329 MAPE: 9.5983 Samples: 1753.0000
({'Dataset': 'Train',
'RMSE': 449.15696331939733,
'MAE': 261.036727532097,
'R2': 0.8221034286834505,
'Adj_R2': 0.8218237970337675,
'MAPE': 9.997835356878639,
'Samples': 7010},
{'Dataset': 'Test',
'RMSE': 435.2251928580389,
'MAE': 257.0005305191101,
'R2': 0.8339897902644267,
'Adj_R2': 0.8329409032413989,
'MAPE': 9.598348304428047,
'Samples': 1753})
NOTE: Below 2 code cells are just meant for debugging to trace issue happening whilst using
evaluate_model_comprehensive. It's just debug pipeline to check if everything is perfect in model/pipeline
Pipeline Check
# Check what happens at each pipeline step
print("Original X shape:", X_train.shape)
# Test each step of your pipeline
if hasattr(linear_model_pipeline, 'named_steps'):
for step_name, step in linear_model_pipeline.named_steps.items():
print(f"\nStep: {step_name}")
if hasattr(step, 'transform'):
try:
transformed = step.transform(X_train)
print(f"After {step_name}: {transformed.shape}")
except:
print(f"Cannot transform at {step_name}")
Original X shape: (7010, 11) Step: preprocessing After preprocessing: (7010, 21) Step: model
# Test the preprocessing steps individually
preprocessed_X = linear_model_pipeline[:-1].transform(X_train) # All steps except final model
print("Preprocessed X shape:", preprocessed_X.shape)
# Test final model prediction on preprocessed data
final_model = linear_model_pipeline[-1]
preds = final_model.predict(preprocessed_X)
print("Final predictions shape:", preds.shape)
Preprocessed X shape: (7010, 21) Final predictions shape: (7010,)
For CatBoost, we need much simpler preprocessing.
CatBoost Preprocessing Requirements
✅ NEEDED:
❌ NOT NEEDED:
CatBoost Pipeline Structure
Pipeline([
('data_prep', CustomDataPrepper()), # Only universal cleaning
('model', CatBoostRegressor()) # That's it!
])
CatBoost Configuration We'll need to specify which columns are categorical:
Key advantage: CatBoost will automatically handle the categorical encoding internally and can capture non-linear patterns that Linear Regression missed.
This should be much faster to implement than the Linear Regression pipeline. Ready to proceed?
m2_name = "CatBoost Model"
# THis is Alternative approach to discard product id and store id column
# NOTE: here the output will be Dataframe
# ?? Commented below class as its not being used for now but if needed then place it in `preprocessing` folder and import it here
# class ColumnDropper(BaseEstimator, TransformerMixin):
# def __init__(self, columns_to_drop):
# self.columns_to_drop = columns_to_drop
# def fit(self, X, y=None):
# return self
# def transform(self, X):
# existing_cols = [col for col in self.columns_to_drop if col in X.columns]
# return X.drop(columns=existing_cols)
# def get_feature_names_out(self, input_features=None):
# # Return input_features minus the dropped ones
# return [feat for feat in input_features if feat not in self.columns_to_drop]
# Now your pipeline becomes:
catboost_preprocessing_pipeline = Pipeline([
# Step 1: Universal data cleaning & preparation
('data_prep', CustomDataPrepper()),
('debug1', DebugTransformer("After Data Cleaning", show_head=True)),
# # Step 2: Drop irrelevant ID columns
# ('drop_ids', ColumnDropper(columns_to_drop=['Product_ID', 'Store_ID'])),
# ('debug2', DebugTransformer("After Dropping ID Columns", show_head=True)),
])
# CatBoost preprocessing pipeline (Alternative)
# NOTE: Here the output will be numpy
# ??Currently not using this as its an alternative to the one above
# catboost_preprocessing_pipeline_v2 = Pipeline([
# # Step 1: Universal data cleaning & preparation
# ('data_prep', CustomDataPrepper()),
# ('debug1', DebugTransformer("After Data Cleaning")),
# # Step 2: No transformations needed, just column organization
# ('column_setup', ColumnTransformer([
# # Pass-through for all features (no transformations)
# ('passthrough', 'passthrough', [
# # Numeric Features
# 'Product_Weight',
# 'Product_Allocated_Area',
# 'Product_MRP',
# 'Store_Age',
# # Categorical Features
# 'Product_Sugar_Content',
# 'Product_Type',
# 'Store_Size',
# 'Store_Location_City_Type',
# 'Store_Type'
# ])
# ], remainder='drop')), # Drop any unspecified columns
# ('debug2', DebugTransformer("After Column Setup"))
# ])
# Define categorical features for CatBoost
cat_features = [
'Product_Sugar_Content',
'Product_Type',
'Store_Size',
'Store_Location_City_Type',
'Store_Type',
]
# Complete CatBoost Pipeline
catboost_model_pipeline = Pipeline([
('preprocessing', catboost_preprocessing_pipeline),
('model', CatBoostRegressor(
cat_features=cat_features,
verbose=False # Suppress training output
))
])
catboost_model_pipeline
Pipeline(steps=[('preprocessing',
Pipeline(steps=[('data_prep', CustomDataPrepper()),
('debug1',
DebugTransformer(message='After Data '
'Cleaning',
show_head=True))])),
('model',
<catboost.core.CatBoostRegressor object at 0x16b309350>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocessing',
Pipeline(steps=[('data_prep', CustomDataPrepper()),
('debug1',
DebugTransformer(message='After Data '
'Cleaning',
show_head=True))])),
('model',
<catboost.core.CatBoostRegressor object at 0x16b309350>)])Pipeline(steps=[('data_prep', CustomDataPrepper()),
('debug1',
DebugTransformer(message='After Data Cleaning',
show_head=True))])CustomDataPrepper()
DebugTransformer(message='After Data Cleaning', show_head=True)
<catboost.core.CatBoostRegressor object at 0x16b309350>
📌 NOTE:
Pipeliene for CatBoost will work ensuring:
ColumnDropper (ie last step in preprocessing) ensures the pipeline output remains a DataFrame
CatBoostRegressor can automatically identify categorical columns by name
fit_model(catboost_model_pipeline)
⚡ Model Training Started ---------
🪵 After Data Cleaning ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
First 2 rows:
Product_Weight Product_Sugar_Content Product_Allocated_Area \
5967 13.64 No Sugar 0.071
8270 10.03 Low Sugar 0.016
Product_Type Product_MRP Store_Size Store_Location_City_Type \
5967 Household 157.10 Medium Tier 2
8270 Frozen Foods 126.66 High Tier 2
Store_Type Store_Age
5967 Supermarket Type2 16
8270 Supermarket Type1 38
⚡ Model Training Completed ---------
quick_model_eval(catboost_model_pipeline, X_train, y_train, X_test, y_test, model_name=m2_name)
🚀 QUICK TEST: CatBoost Model ---------------------------------------- Feature names: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age'] Train Metrics:--- RMSE: 244.31 R2: 0.947 MAE: 89.62 Test Metrics:--- RMSE: 277.21 R2: 0.933 MAE: 103.77
Key Highlights 🌟
Business Impact 💼
This model is highly accurate, tight, and stable - likely near optimal.
| Metric | Value | Interpretation |
|---|---|---|
| RMSE | 277.21 | ✅ Outstanding — just ~8% of target mean, ~26% of std |
| MAE | 103.77 | ✅ Extremely low error — ~3% of mean, very tight predictions |
| R² | 0.933 | ✅ Excellent — explains 93.3% of target variance |
| Metric | Old Model | CatBoost | Improvement |
|---|---|---|---|
| Test RMSE | 435.23 | 277.21 | ⬇️ ~36% |
| Test MAE | 257.00 | 103.77 | ⬇️ ~60% |
| Test R² | 0.834 | 0.933 | ⬆️ ~10% absolute gain |
evaluate_model_comprehensive(
model=catboost_model_pipeline,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
model_name=m2_name,
model_type="Tree",
tracker=tracker
)
📊 CatBoost Model PERFORMANCE ================================================== Training Metrics: RMSE: 244.3106 MAE: 89.6245 R2: 0.9474 Adj_R2: 0.9473 MAPE: 3.4579 Samples: 7010.0000 Testing Metrics: RMSE: 277.2118 MAE: 103.7653 R2: 0.9327 Adj_R2: 0.9322 MAPE: 3.7795 Samples: 1753.0000
({'Dataset': 'Train',
'RMSE': 244.31064176551922,
'MAE': 89.62448631091097,
'R2': 0.9473672486757516,
'Adj_R2': 0.9472845164287429,
'MAPE': 3.457860281440354,
'Samples': 7010},
{'Dataset': 'Test',
'RMSE': 277.2117996479401,
'MAE': 103.76530600486473,
'R2': 0.9326511522361677,
'Adj_R2': 0.932225628212387,
'MAPE': 3.7794830900862766,
'Samples': 1753})
Model Performance Improvement
🚀 REMEMBER: Randomized vs GridSearch 🧠
Let's compare for our use case:
RandomizedSearchCV Advantages:
GridSearchCV Disadvantages:
Verdict: RandomizedSearchCV is better for our case:
m3_name = "CatBoost_Tuned"
For CatBoost regression, key hyperparameters to tune are:
Learning Parameters:
learning_rateiterations (n_estimators)depth (max_depth)Regularization:
l2_leaf_regmin_data_in_leafOptimization:
early_stopping_roundseval_metric='RMSE' (our metric of choice)Devised approach:
# Parameter distribution for random search
param_distributions = {
'model__iterations': randint(100, 1000),
'model__learning_rate': uniform(0.01, 0.3),
'model__depth': randint(4, 10),
'model__l2_leaf_reg': uniform(1, 10),
'model__min_data_in_leaf': randint(1, 30)
}
NOTE: ⚠️ Use prefix model__ because CatBoostRegressor is wrapped in the Pipeline under the 'model' step.
# RandomizedSearchCV setup
catboost_rs = RandomizedSearchCV(
estimator=catboost_model_pipeline,
param_distributions=param_distributions,
n_iter=20, # number of parameter settings sampled
cv=5, # 5-fold cross-validation
scoring='neg_root_mean_squared_error',
n_jobs=-1, # use all CPU cores
verbose=1,
random_state=42
)
# Fit CatBoost - RandomizedSearchCV
fit_model(catboost_rs)
⚡ Model Training Started ---------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
🪵 After Data Cleaning ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
First 2 rows:
Product_Weight Product_Sugar_Content Product_Allocated_Area \
5967 13.64 No Sugar 0.071
8270 10.03 Low Sugar 0.016
Product_Type Product_MRP Store_Size Store_Location_City_Type \
5967 Household 157.10 Medium Tier 2
8270 Frozen Foods 126.66 High Tier 2
Store_Type Store_Age
5967 Supermarket Type2 16
8270 Supermarket Type1 38
⚡ Model Training Completed ---------
# Get best results
print("\nBest parameters:", catboost_rs.best_params_)
print("Best RMSE:", (-catboost_rs.best_score_)**0.5) # Convert back from negative RMSE
Best parameters: {'model__depth': 7, 'model__iterations': 513, 'model__l2_leaf_reg': 3.1233911067827616, 'model__learning_rate': 0.06454749016213018, 'model__min_data_in_leaf': 21}
Best RMSE: 16.78439446861335
🔍 Why a minus (-) appears?
actual_rmse = -search.bestscore
# For final predictions, use:
catboost_rs_tuned = catboost_rs.best_estimator_
Difference between best_estimator_ and original search instance
Both serve different purposes:
random_search.best_estimator_:
random_search:
# For final predictions, use:
best_model = random_search.best_estimator_
predictions = best_model.predict(X_test)
TIPS: ✅ Always use
best_estimator_for your final model!
Using bestestimator is preferred because:
quick_model_eval(catboost_rs_tuned, X_train, y_train, X_test, y_test, model_name=m3_name)
🚀 QUICK TEST: CatBoost_Tuned ---------------------------------------- Feature names: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age'] Train Metrics:--- RMSE: 251.03 R2: 0.944 MAE: 92.85 Test Metrics:--- RMSE: 277.09 R2: 0.933 MAE: 105.17
🧠 Insight:
Tuning didn’t move the needle much, which is often the case when the base model is already near-optimal.
- Minor fluctuations in MAE/RMSE are likely within noise bounds.
Key Highlights 🌟
Shows slightly better stability!
| Metric | Value | Interpretation |
|---|---|---|
| RMSE | 277.09 | ✅ Outstanding - ~8% of target mean, ~26% of std |
| MAE | 105.17 | ✅ Still excellent - low average error (~3% of mean) |
| R² | 0.933 | ✅ Excellent - explains 93.3% of target variance (same as before) |
evaluate_model_comprehensive(
model=catboost_rs_tuned,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
model_name=m3_name,
model_type="Tree",
tracker=tracker
)
📊 CatBoost_Tuned PERFORMANCE ================================================== Training Metrics: RMSE: 251.0255 MAE: 92.8548 R2: 0.9444 Adj_R2: 0.9443 MAPE: 3.5841 Samples: 7010.0000 Testing Metrics: RMSE: 277.0911 MAE: 105.1661 R2: 0.9327 Adj_R2: 0.9323 MAPE: 3.8729 Samples: 1753.0000
({'Dataset': 'Train',
'RMSE': 251.0255178978363,
'MAE': 92.85475093875208,
'R2': 0.9444342670493849,
'Adj_R2': 0.9443469245140239,
'MAPE': 3.584149036976944,
'Samples': 7010},
{'Dataset': 'Test',
'RMSE': 277.09109978117425,
'MAE': 105.16610850102744,
'R2': 0.9327097877434043,
'Adj_R2': 0.9322846341909502,
'MAPE': 3.8728926827812695,
'Samples': 1753})
Key Parameters to Tune:
Regularization Type:
Regularization Strength:
Ridge is better for our case because:
m4_name = 'Linear Model Tuned'
# Update pipeline with Ridge
ridge_model_pipeline = Pipeline([
('preprocessing', linear_preprocessing_pipeline),
('model', Ridge())
])
# Grid of alpha values
param_grid = {
'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
# 'model__alpha': np.logspace(-3, 3, 20)
}
# GridSearchCV setup
grid_search_ridge = GridSearchCV(
estimator=ridge_model_pipeline,
param_grid=param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1,
verbose=1
)
fit_model(grid_search_ridge)
⚡ Model Training Started --------- Fitting 5 folds for each of 7 candidates, totalling 35 fits 🪵 After Data Cleaning ---- (D) Shape: (7010, 9) Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age'] 🪵 After Category Consolidation ---- (D) Shape: (7010, 9) Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age'] 🪵 After Preprocessing (ie Column Transformation) ---- (D) Data type: numpy array ⚡ Model Training Completed ---------
# Step 5: Evaluate
print("Best Parameters:", grid_search_ridge.best_params_)
print("Best Score (neg RMSE):", grid_search_ridge.best_score_)
Best Parameters: {'model__alpha': 10.0}
Best Score (neg RMSE): -450.2733562560611
# For final predictions, use:
linear_model_tuned = grid_search_ridge.best_estimator_
quick_model_eval(linear_model_tuned, X_train, y_train, X_test, y_test, model_name=m4_name)
🚀 QUICK TEST: Linear Model Tuned ---------------------------------------- Feature names: ['onehot__Product_Sugar_Content_No Sugar' 'onehot__Product_Sugar_Content_Regular' 'onehot__Product_Type_Canned' 'onehot__Product_Type_Dairy' 'onehot__Product_Type_Frozen Foods' 'onehot__Product_Type_Fruits and Vegetables' 'onehot__Product_Type_Health and Hygiene' 'onehot__Product_Type_Household' 'onehot__Product_Type_Meat' 'onehot__Product_Type_Others' 'onehot__Product_Type_Snack Foods' 'onehot__Product_Type_Soft Drinks' 'onehot__Store_Type_Food Mart' 'onehot__Store_Type_Supermarket Type1' 'onehot__Store_Type_Supermarket Type2' 'power_transform__Product_Allocated_Area' 'ordinal__Store_Size' 'ordinal__Store_Location_City_Type' 'standard_scale__Product_Weight' 'standard_scale__Product_MRP' 'standard_scale__Store_Age'] Train Metrics:--- RMSE: 449.14 R2: 0.822 MAE: 259.54 Test Metrics:--- RMSE: 435.14 R2: 0.834 MAE: 255.47
Key Highlights 🌟
Consider using CatBoost for highest accuracy, or this linear model when simplicity/speed is prioritized over maximum precision.
evaluate_model_comprehensive(
model=linear_model_tuned,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
model_name=m4_name,
model_type="Regression",
tracker=tracker
)
📊 Linear Model Tuned PERFORMANCE ================================================== Training Metrics: RMSE: 449.1358 MAE: 259.5390 R2: 0.8221 Adj_R2: 0.8218 MAPE: 9.9623 Samples: 7010.0000 Testing Metrics: RMSE: 435.1363 MAE: 255.4742 R2: 0.8341 Adj_R2: 0.8330 MAPE: 9.5640 Samples: 1753.0000
({'Dataset': 'Train',
'RMSE': 449.1358366713886,
'MAE': 259.53897853870836,
'R2': 0.8221201634561986,
'Adj_R2': 0.8218405581115313,
'MAPE': 9.962270418096642,
'Samples': 7010},
{'Dataset': 'Test',
'RMSE': 435.13632276038066,
'MAE': 255.47423065159316,
'R2': 0.8340575797101025,
'Adj_R2': 0.8330091209948878,
'MAPE': 9.564044966201452,
'Samples': 1753})
# Get Summary
tracker.get_results_summary_vertically()
| Model_Name | Linear Regression | CatBoost Model | CatBoost_Tuned | Linear Model Tuned |
|---|---|---|---|---|
| Train_R2 | 0.822103 | 0.947367 | 0.944434 | 0.822120 |
| Test_R2 | 0.833990 | 0.932651 | 0.932710 | 0.834058 |
| Train_Adj_R2 | 0.821824 | 0.947285 | 0.944347 | 0.821841 |
| Test_Adj_R2 | 0.832941 | 0.932226 | 0.932285 | 0.833009 |
| Train_RMSE | 449.156963 | 244.310642 | 251.025518 | 449.135837 |
| Test_RMSE | 435.225193 | 277.211800 | 277.091100 | 435.136323 |
| Train_MAE | 261.036728 | 89.624486 | 92.854751 | 259.538979 |
| Test_MAE | 257.000531 | 103.765306 | 105.166109 | 255.474231 |
| Train_MAPE | 9.997835 | 3.457860 | 3.584149 | 9.962270 |
| Test_MAPE | 9.598348 | 3.779483 | 3.872893 | 9.564045 |
| Overfitting_Score | -0.011886 | 0.014716 | 0.011724 | -0.011937 |
⚡ Insights
# Plot Comparision
tracker.plot_model_comparison()
# Pick Best Model
best_model_name, best_score = tracker.get_best_model(metric='Test_R2', higher_better=True)
🏆 BEST MODEL: CatBoost_Tuned 📊 Test_R2: 0.9327
🚀 Thus CatBoost model is Selected
best_model = catboost_rs_tuned
Making ready for Production
type(best_model)
sklearn.pipeline.Pipeline
best_model.named_steps
{'preprocessing': Pipeline(steps=[('data_prep', CustomDataPrepper()),
('debug1',
DebugTransformer(message='After Data Cleaning',
show_head=True))]),
'model': <catboost.core.CatBoostRegressor at 0x16af3e690>}
preprocessing_pipeline = best_model.named_steps['preprocessing']
debug_transformer = preprocessing_pipeline.named_steps['debug1']
type(debug_transformer)
preprocessing.debug_transformer.DebugTransformer
# * For Production debugging is not required
debug_transformer.turn_off()
print('Is Debug Step Active?', best_model.named_steps['preprocessing']['debug1'].active)
Is Debug Step Active? False
class ModelSerializer:
def __init__(self, base_path='models'):
self.base_path = Path(base_path)
self.base_path.mkdir(exist_ok=True)
def save_model(self, model, model_name, include_timestamp=False):
"""Save model with optional timestamp"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if include_timestamp else ""
filename = f"{model_name}_{timestamp}.joblib" if include_timestamp else f"{model_name}.joblib"
path = self.base_path / filename
joblib.dump(model, path)
print(f"✅ Model saved: {path}")
return path
def load_model(self, model_path):
"""Load model from path"""
path = Path(model_path)
if not path.exists():
raise FileNotFoundError(f"Model not found: {path}")
model = joblib.load(path)
print(f"✅ Model loaded: {path}")
return model
serializer = ModelSerializer()
# Get pipeline steps without debug
# For Debugging Purpose only (Ignore)
clean_steps = [(name, step) for name, step in best_model.named_steps.items()
if not isinstance(step, DebugTransformer)]
# Now save
serializer.save_model(best_model, "best_model", include_timestamp=False)
✅ Model saved: models/best_model.joblib
PosixPath('models/best_model.joblib')
def test_saved_model(model_path):
"""
Test saved model by loading and verifying predictions
Args:
model_path: Path to saved model
X_test: Test features
y_test: Test targets
"""
print("\n🔄 Testing Saved Model")
print("-" * 40)
# Load model
loaded_model = serializer.load_model(model_path)
# Check cat names
model = loaded_model.named_steps['model']
cat_features = model.get_params()['cat_features']
print("Cat Features: ", cat_features)
# Verify predictions
print("\n🔍 Performance Verification:")
metrics = quick_model_eval(
pipeline=loaded_model,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
model_name="Loaded Model"
)
return loaded_model, metrics
loaded_model, verification_metrics = test_saved_model("models/best_model.joblib")
🔄 Testing Saved Model ---------------------------------------- ✅ Model loaded: models/best_model.joblib Cat Features: ['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type'] 🔍 Performance Verification: 🚀 QUICK TEST: Loaded Model ---------------------------------------- Feature names: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age'] Train Metrics:--- RMSE: 251.03 R2: 0.944 MAE: 92.85 Test Metrics:--- RMSE: 277.09 R2: 0.933 MAE: 105.17
loaded_model
Pipeline(steps=[('preprocessing',
Pipeline(steps=[('data_prep', CustomDataPrepper()),
('debug1',
DebugTransformer(active=False,
message='After Data '
'Cleaning',
show_head=True))])),
('model',
<catboost.core.CatBoostRegressor object at 0x16b399bd0>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocessing',
Pipeline(steps=[('data_prep', CustomDataPrepper()),
('debug1',
DebugTransformer(active=False,
message='After Data '
'Cleaning',
show_head=True))])),
('model',
<catboost.core.CatBoostRegressor object at 0x16b399bd0>)])Pipeline(steps=[('data_prep', CustomDataPrepper()),
('debug1',
DebugTransformer(active=False, message='After Data Cleaning',
show_head=True))])CustomDataPrepper()
DebugTransformer(active=False, message='After Data Cleaning', show_head=True)
<catboost.core.CatBoostRegressor object at 0x16b399bd0>
# Custom Data Checker
def test_inference(model):
# Test data
# NOTE: !
# - though we do not provide Store_Age, at correct place compare to training data
# - But it will be handled by model CustomDataPrepper as it will unify column order
#
query_data = {
'Product_Weight': 100,
'Product_MRP': 50,
'Product_Allocated_Area': 200,
#'Store_Age': 5,
'Store_Establishment_Year': 2020,
'Product_Sugar_Content': 'Regular',
'Product_Type': 'Dairy',
'Store_Size': 'Medium',
'Store_Location_City_Type': 'Tier 1',
'Store_Type': 'Supermarket Type1'
}
dft = pd.DataFrame([query_data])
return model.predict(dft)
test_inference(loaded_model)
array([4192.07139122])
⚡ Woohoo.. It worked !!
Validated ✅
NOTE:
Since we trained our CatBoost model directly on the raw sales values (Product_Store_Sales_Total), we don't need any inverse transformations like np.exp().
Our model:
Unlike some cases where you might:
So we can use the model's predictions directly in our API responses.
When you serialize an object (like an ML model) with joblib or pickle, it records the fully qualified name of any custom classes used - i.e <module_path>.<class_name>
If this class is not found in the same module path during deserialization, you get errors like:
AttributeError: Can't get attribute 'MyCustomClass' on <module '__main__'>
def print_class_location(cls):
# From object we can get the class via __class__
print(f"{cls.__module__}.{cls.__qualname__}")
print_class_location(DebugTransformer)
preprocessing.debug_transformer.DebugTransformer
print_class_location(CustomDataPrepper)
preprocessing.custom_data_prepper.CustomDataPrepper
🔐 NOTE on Access
Both Spaces should be Public or both should be Private, otherwise you’ll get CORS or auth errors.
If Private, make sure you're logged in when testing frontend (or use Hugging Face tokens).
Setup
Access Token (We'll need this later):
NOTE: This needed only whilst using hyggingface-api (ie python package) to upload files and folders from local to spaces, If you use github or manual upload it's not needed
TIPS (Cells | Notebooks)
High Level Backend Folder Structure Idea
backend_files/
├── app/
│ ├── __init__.py
│ ├── main.py
│ ├── routes.py
│ ├── model_loader.py
│ ├── predict.py
├── models/
│ └── best_model.joblib # (you'll copy this manually)
├── preprocessing/
│ ├── custom_data_prepper.py
│ ├── debug_transformer.py
│ ├── __init__.py
├── requirements.txt
├── Dockerfile
└── README.md # (optional)
Create Folder (Only Once)
# Create main backend directory
os.makedirs("backend", exist_ok=True)
# Create app and models subdirectories
os.makedirs("backend/app", exist_ok=True)
os.makedirs("backend/models", exist_ok=True)
# Helpers
# Create preprocessing folder
os.makedirs("backend/preprocessing", exist_ok=True)
# Copy all files from preprocessing to backend/preprocessing
!cp preprocessing/* backend/preprocessing/
cp: preprocessing/__pycache__ is a directory (not copied).
# Copy model to backend
!cp models/best_model.joblib backend/models/
NOTE:
# %%writefile backend/app/model_loader.py
# import joblib
# import os
# from typing import Optional
# # Singleton pattern for model
# _model = None
# def load_model(path: str = "models/best_model.joblib") -> Optional[object]:
# """
# Load model using singleton pattern to avoid multiple loads
# Args:
# path: Path to saved model file
# Returns:
# Loaded model object
# Raises:
# FileNotFoundError: If model file doesn't exist
# """
# global _model
# if _model is None:
# base_dir = os.path.dirname(__file__)
# model_path_rel = os.path.join(base_dir, '..', path) # path/to/backend
# # ! note this is optional step model_path_rel also works
# # this just converts to absolute path (thats it)
# model_path = os.path.abspath(model_path_rel)
# if not os.path.exists(model_path):
# raise FileNotFoundError(f"Model not found at {model_path}")
# try:
# _model = joblib.load(model_path)
# print(f"✅ Model loaded successfully from {model_path}")
# except Exception as e:
# raise Exception(f"Error loading model: {str(e)}")
# return _model
%%writefile backend/app/model_loader.py
#import __main__
import joblib
import os
from typing import Optional
from preprocessing import CustomDataPrepper, DebugTransformer
#from ..preprocessing import CustomDataPrepper, DebugTransformer
# Using relative imports is better practice when dealing with modules within the same package
# ! Below is hack used earlier when Preprocessing Helper class were not extracted to separate package !
# ---
# Register it in __main__ so joblib/pickle can resolve it
#__main__.CustomDataPrepper = CustomDataPrepper
#__main__.DebugTransformer = DebugTransformer
# ---
# Class based approach to load model (setup)
class ModelLoader:
"""Singleton class for loading and managing ML model"""
_instance = None
_model = None
def __new__(cls):
if cls._instance is None:
cls._instance = super(ModelLoader, cls).__new__(cls)
return cls._instance
def get_model(self, path: str = "models/best_model.joblib") -> Optional[object]:
"""Get model, loading it if not already loaded"""
if self._model is None:
print('Model not loaded yet, loading ...')
base_dir = os.path.dirname(__file__)
model_path_rel = os.path.join(base_dir, '..', path) # path/to/backend
# ! note this is optional step model_path_rel also works
# this just converts to absolute path (thats it)
model_path = os.path.abspath(model_path_rel)
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model not found at {model_path}")
try:
self._model = joblib.load(model_path)
print(f"✅ Model loaded successfully from {model_path}")
except Exception as e:
raise Exception(f"Error loading model: {str(e)}")
return self._model
Overwriting backend/app/model_loader.py
%%writefile backend/app/predict.py
import pandas as pd
from typing import Dict, List, Union
from .model_loader import ModelLoader
# Pure Prediction Logic
class SalesPredictor:
"""Handle sales prediction logic and data processing"""
def __init__(self):
self.model = ModelLoader().get_model()
def predict_single(self, data: Dict) -> Dict:
"""
Predict sales for single store-product
Args:
data: Dictionary with feature values
Returns:
Dictionary with prediction and metadata
"""
try:
# Convert single input to DataFrame
df = pd.DataFrame([data])
# Make prediction
prediction = self.model.predict(df)[0]
return {
"status": "success",
"predicted_sales": float(prediction),
"message": "Prediction successful"
}
except Exception as e:
return {
"status": "error",
"message": f"Prediction failed: {str(e)}"
}
def predict_batch(self, df: pd.DataFrame) -> Dict:
"""
Predict sales for multiple store-products
Args:
df: DataFrame with feature values
Returns:
Dictionary with predictions and metadata
"""
try:
# Make predictions
predictions = self.model.predict(df)
return {
"status": "success",
"predictions": predictions.tolist(),
"message": f"Successfully predicted {len(predictions)} samples"
}
except Exception as e:
return {
"status": "error",
"message": f"Batch prediction failed: {str(e)}"
}
Writing backend/app/predict.py
%%writefile backend/app/routes.py
from flask import Blueprint, request, jsonify
from .predict import SalesPredictor
import pandas as pd
# Create Blueprint for prediction routes
router = Blueprint('predictions', __name__)
# Required features for prediction
REQUIRED_FEATURES = [
'Product_Weight', 'Product_MRP', 'Product_Allocated_Area',
'Store_Establishment_Year', 'Product_Sugar_Content', 'Product_Type',
'Store_Size', 'Store_Location_City_Type', 'Store_Type'
]
# Initialize predictor
predictor = SalesPredictor()
@router.get('/')
def home():
return "Welcome to SuperKart Sales Prediction API!"
@router.post('/predict')
def predict_sales():
"""Single prediction endpoint"""
try:
# Get JSON data
data = request.get_json()
# Validate required features
missing_features = [feat for feat in REQUIRED_FEATURES if feat not in data]
if missing_features:
return jsonify({
"status": "error",
"message": f"Missing required features: {missing_features}"
}), 400
# Create DataFrame with required features
input_data = {feature: data[feature] for feature in REQUIRED_FEATURES}
# Get prediction
result = predictor.predict_single(input_data)
return jsonify(result), 200 if result["status"] == "success" else 400
except Exception as e:
return jsonify({
"status": "error",
"message": f"Prediction failed: {str(e)}"
}), 500
@router.post('/predict_batch')
def predict_sales_batch():
"""Batch prediction endpoint"""
try:
# Check if file was uploaded
if 'file' not in request.files:
return jsonify({
"status": "error",
"message": "No file uploaded"
}), 400
file = request.files['file']
# Read CSV file
df = pd.read_csv(file)
# Validate required features
missing_features = [feat for feat in REQUIRED_FEATURES if feat not in df.columns]
if missing_features:
return jsonify({
"status": "error",
"message": f"Missing required features in CSV: {missing_features}"
}), 400
# Select only required features
input_df = df[REQUIRED_FEATURES]
# Get predictions
result = predictor.predict_batch(input_df)
return jsonify(result), 200 if result["status"] == "success" else 400
except Exception as e:
return jsonify({
"status": "error",
"message": f"Batch prediction failed: {str(e)}"
}), 500
Overwriting backend/app/routes.py
%%writefile backend/app/main.py
from flask import Flask
from .routes import router # Import Blueprint directly
def create_app():
"""Create and configure Flask application"""
# NOTE: Flask(__name__) is the recommended way because
# 1. Package Resolution
# 2. Custom string might break package resolution
app = Flask(__name__) # Instead of Flask("SuperKart Sales Predictor")
# Configure app settings
app.config.update(
JSON_SORT_KEYS=False, # Preserve JSON response order
MAX_CONTENT_LENGTH=16 * 1024 * 1024 # 16MB max file size for batch predictions
)
# Register routes blueprint
app.register_blueprint(router)
return app
# Create app instance
app = create_app()
if __name__ == '__main__':
# For local development
# app.run(debug=True)
# For HuggingFace deployment
app.run(host="0.0.0.0", port=7860, debug=True)
Writing backend/app/main.py
This:
Result:
%%writefile backend/app/__init__.py
# Future exports when needed:
# from .main import app
# from .routes import router
# from .predict import SalesPredictor
# from .model_loader import ModelLoader
# __all__ = ['app', 'router', 'SalesPredictor', 'ModelLoader']
Writing backend/app/__init__.py
Requirements (ie Dependencies Listing)
# let's check our current environment
!pip freeze > current_requirements.txt
# !cat current_requirements.txt
!pip list | grep -i "pandas\|numpy\|scikit\|catboost\|joblib\|flask\|gunicorn\|requests"
catboost 1.2.8 Flask 2.2.2 gunicorn 23.0.0 joblib 1.2.0 numpy 1.24.3 numpydoc 1.5.0 pandas 2.3.0 requests 2.31.0 requests-file 1.5.1 requests-toolbelt 1.0.0 scikit-image 0.20.0 scikit-learn 1.2.2
import sklearn as sk
import catboost
import flask
import gunicorn
import requests
# Let see current versions
print("Current versions in use:")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"scikit-learn: {sk.__version__}")
print(f"catboost: {catboost.__version__}")
print(f"joblib: {joblib.__version__}")
print(f"flask: {flask.__version__}")
print(f"gunicorn: {gunicorn.__version__}")
print(f"requests: {requests.__version__}")
Current versions in use: pandas: 2.3.0 numpy: 1.24.3 scikit-learn: 1.2.2 catboost: 1.2.8 joblib: 1.2.0 flask: 2.2.2 gunicorn: 23.0.0 requests: 2.31.0
%%writefile backend/requirements.txt
# ML & Data Processing
pandas==2.3.0
numpy==1.24.3
scikit-learn==1.2.2
catboost==1.2.8
joblib==1.2.0
# API & Server
# flask==2.2.2
flask>=2.0,<2.3
werkzeug<3.0.0
gunicorn==23.0.0
# Utilities
requests==2.31.0
Writing backend/requirements.txt
%%writefile backend/Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ app/
COPY models/ models/
COPY preprocessing/ preprocessing/
# Set environment variable for port
ENV PORT=7860
EXPOSE 7860
CMD ["gunicorn", "--workers", "1", "--timeout", "180", "--bind", "0.0.0.0:7860", "app.main:app"]
Overwriting backend/Dockerfile
Try to find way to make PORT number dynamically ie not hard code 7860 everywhere
Gist
| Section | What's Good |
|---|---|
python:3.10-slim |
✅ Lightweight base image |
ENV PORT=7860 |
✅ Flexible port declaration |
EXPOSE ${PORT} |
✅ Dynamically exposes the correct port |
gunicorn usage |
✅ Better for production-like demo |
--workers 1 |
✅ Hugging Face Spaces = low-concurrency; one worker is fine |
--timeout 180 |
✅ Helpful if your model loads slowly |
"app.main:app" |
✅ Correct WSGI path (i.e., app = Flask(...) in app/main.py) |
🔍 Just One Check
Be sure this exists in app/main.py:
app = Flask(__name__)
If that line is wrapped in a function like create_app(), then Gunicorn won’t work unless you adjust the WSGI path.
Alternative
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ app/
COPY models/ models/
ENV PORT=8000
# Add memory limit for ML model
ENV MAX_WORKERS=1
EXPOSE ${PORT}
CMD ["sh", "-c", "gunicorn --timeout 180 -w ${MAX_WORKERS} -b 0.0.0.0:${PORT} app.main:app"]
# HF Config
access_token = "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"
repo_id = "nipunshah/superkart-backend"
# Login to Hugging Face
login(token=access_token)
# Initialize API
api = HfApi()
# Upload backend folder
api.upload_folder(
folder_path="backend", # Our local backend folder
repo_id=repo_id, # Your space name
repo_type="space" # Type is space
)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /Users/nipunshah/.cache/huggingface/token Login successful
'https://huggingface.co/spaces/nipunshah/superkart-backend/tree/main/'
NOTE:
Successful deployment logs would look something like this in container tab
===== Application Startup at 2025-07-04 17:55:31 =====
[2025-07-04 17:55:43 +0000] [1] [INFO] Starting gunicorn 23.0.0
[2025-07-04 17:55:43 +0000] [1] [INFO] Listening at: http://0.0.0.0:7860 (1)
[2025-07-04 17:55:43 +0000] [1] [INFO] Using worker: sync
[2025-07-04 17:55:43 +0000] [7] [INFO] Booting worker with pid: 7
def test_api(data: dict, endpoint: str = 'predict') -> None:
"""
Test SuperKart API endpoints
Args:
data: Dictionary of features for prediction
endpoint: 'predict' or 'predict_batch'
base_url: HF space URL (default: your space URL)
"""
base_url = "https://nipunshah-superkart-backend.hf.space"
url = f"{base_url}/{endpoint}"
try:
# Make API call
response = requests.post(url, json=data)
print(f"Status Code: {response.status_code}")
print("\nResponse:")
print(json.dumps(response.json(), indent=2))
except Exception as e:
print(f"Error: {str(e)}")
test_data = {
'Product_Weight': 100,
'Product_MRP': 50,
'Product_Allocated_Area': 200,
'Store_Establishment_Year': 2020,
'Product_Sugar_Content': 'Regular',
'Product_Type': 'Dairy',
'Store_Size': 'Medium',
'Store_Location_City_Type': 'Tier 1',
'Store_Type': 'Supermarket Type1'
}
test_api(test_data)
Status Code: 200
Response:
{
"status": "success",
"predicted_sales": 4192.071391218717,
"message": "Prediction successful"
}
✨ Hurray, Verified ✅
Structure
frontend/
├── app.py # Streamlit application
├── requirements.txt # Dependencies
└── Dockerfile # Container configuration
Let see first unique values for each catgeorical enums
def get_list_of_unique_values(df, column_name):
"""
Get list of unique values for a given column in a DataFrame
Args:
df: DataFrame containing the column
column_name: Name of the column to get unique values from
Returns:
List of unique values
"""
return df[column_name].unique().tolist()
product_types = get_list_of_unique_values(df, 'Product_Type')
product_types
['Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Health and Hygiene', 'Snack Foods', 'Meat', 'Household', 'Hard Drinks', 'Fruits and Vegetables', 'Breads', 'Soft Drinks', 'Breakfast', 'Others', 'Starchy Foods', 'Seafood']
product_sugar_contents = get_list_of_unique_values(df, 'Product_Sugar_Content')
product_sugar_contents.remove('reg')
product_sugar_contents
['Low Sugar', 'Regular', 'No Sugar']
store_sizes = get_list_of_unique_values(df, 'Store_Size')
store_sizes
['Medium', 'High', 'Small']
store_location_city_types = get_list_of_unique_values(df, 'Store_Location_City_Type')
store_location_city_types
['Tier 2', 'Tier 1', 'Tier 3']
store_types = get_list_of_unique_values(df, 'Store_Type')
store_types
['Supermarket Type2', 'Departmental Store', 'Supermarket Type1', 'Food Mart']
Create Frontend main root directory
# Create frontend directory
os.makedirs("frontend", exist_ok=True)
app.py (main file for streamlit | entry)
%%writefile frontend/app.py
import streamlit as st
import requests
import pandas as pd
import datetime
import io
# Base URL for the backend API
BASE_URL = "https://nipunshah-superkart-backend.hf.space"
# Set page title
st.title("SuperKart Sales Prediction")
# Create tabs for single prediction and batch prediction
tab1, tab2 = st.tabs(["Single Prediction", "Batch Prediction"])
product_types = [
"Frozen Foods",
"Dairy",
"Canned",
"Baking Goods",
"Health and Hygiene",
"Snack Foods",
"Meat",
"Household",
"Hard Drinks",
"Fruits and Vegetables",
"Breads",
"Soft Drinks",
"Breakfast",
"Others",
"Starchy Foods",
"Seafood",
]
product_sugar_contents = ["Low Sugar", "Regular", "No Sugar"]
store_sizes = ["Small", "Medium", "High"]
store_location_city_types = ["Tier 1", "Tier 2", "Tier 3"]
store_types = [
"Supermarket Type2",
"Departmental Store",
"Supermarket Type1",
"Food Mart",
]
with tab1:
st.subheader("Predict Sales for Single Product")
# Numeric inputs
product_weight = st.number_input("Product Weight", min_value=0.0, step=0.01)
product_mrp = st.number_input("Product MRP", min_value=0.0, step=0.1)
product_allocated_area = st.number_input(
"Product Allocated Area (0 to 1)",
min_value=0.0,
max_value=1.0,
value=0.5,
step=0.01,
format="%.2f", # shows 2 decimal places
)
store_est_year = st.number_input(
"Store Establishment Year",
min_value=1900,
max_value=datetime.datetime.now().year,
step=1,
)
# Enumerated Kinda Input ---
product_sugar_content = st.selectbox(
"Product Sugar Content", product_sugar_contents
)
product_type = st.selectbox("Product Type", product_types)
store_size = st.selectbox("Store Size", store_sizes)
store_location_city_type = st.selectbox(
"Store Location City Type", store_location_city_types
)
store_type = st.selectbox("Store Type", store_types)
# Predict button
if st.button("Predict Sales"):
# Prepare data
data = {
"Product_Weight": product_weight,
"Product_MRP": product_mrp,
"Product_Allocated_Area": product_allocated_area,
"Store_Establishment_Year": store_est_year,
"Product_Sugar_Content": product_sugar_content,
"Product_Type": product_type,
"Store_Size": store_size,
"Store_Location_City_Type": store_location_city_type,
"Store_Type": store_type,
}
with st.spinner("Predicting..."):
# Make prediction request
try:
response = requests.post(f"{BASE_URL}/predict", json=data)
result = response.json()
if result["status"] == "success":
st.success(f"Predicted Sales: ₹{result['predicted_sales']:.2f}")
else:
st.error(f"Prediction failed: {result['message']}")
except Exception as e:
st.error(f"Error making prediction: {str(e)}")
with tab2:
st.subheader("Predict Sales for Multiple Products")
# File upload
uploaded_file = st.file_uploader("Upload CSV file", type=["csv"])
if uploaded_file is not None:
try:
# Read the file into a DataFrame
# Create a copy of the file data
#file_contents = uploaded_file.getvalue()
# # Use one copy for preview
# df = pd.read_csv(io.BytesIO(file_contents))
# st.write("Preview of uploaded data:")
# st.write(df.head())
# Preview the uploaded data
df = pd.read_csv(uploaded_file)
st.write("Preview of uploaded data:")
st.write(df.head())
# Check required columns
required_cols = [
"Product_Weight",
"Product_MRP",
"Product_Allocated_Area",
"Store_Establishment_Year",
"Product_Sugar_Content",
"Product_Type",
"Store_Size",
"Store_Location_City_Type",
"Store_Type",
]
missing_cols = [col for col in required_cols if col not in df.columns]
if missing_cols:
st.error(f"Missing required columns: {', '.join(missing_cols)}")
else:
# Make prediction button
if st.button("Predict Sales for All Products"):
with st.spinner("Predicting..."):
# Make prediction request
try:
# Reset file pointer and create a new file-like object
uploaded_file.seek(0) # as we read it earlier for preview
# Create a named temporary file-like object
files = {
"file": ("batch_predict.csv", uploaded_file, "text/csv")
}
# files = {"file": uploaded_file}
response = requests.post(
f"{BASE_URL}/predict_batch", files=files
)
result = response.json()
if result["status"] == "success":
# Add predictions to dataframe
df["Predicted_Sales"] = result["predictions"]
st.success("Predictions completed successfully!")
st.write("Results:")
st.write(df)
# Download button for results
csv = df.to_csv(index=False)
st.download_button(
"Download Results",
csv,
"predictions.csv",
"text/csv",
key="download-csv",
)
else:
st.error(f"Prediction failed: {result['message']}")
except Exception as e:
st.error(f"Error making prediction: {str(e)}")
except Exception as e:
st.error(f"Error reading file: {str(e)}")
Overwriting frontend/app.py
Requirements
%%writefile frontend/requirements.txt
streamlit
requests
pandas
Writing frontend/requirements.txt
Docker
%%writefile frontend/Dockerfile
FROM python:3.9-slim
WORKDIR /app
# Install system dependencies (needed by streamlit sometimes)
RUN apt-get update && apt-get install -y build-essential
COPY requirements.txt .
RUN pip install -r requirements.txt
# ! (MUST NEEDED OTHERWISE FILE UPLOAD WONT WORK)
# Create .streamlit directory with correct permissions
RUN mkdir -p /app/.streamlit && chmod 777 /app/.streamlit
# Copy all files from current directory to container's /app directory
COPY . .
# Setup .streamlit config # ! (MUST NEEDED OTHERWISE FILE UPLOAD WONT WORK)
RUN mkdir -p /app/.streamlit
RUN echo "\
[server]\n\
headless = true\n\
port = 7860\n\
enableCORS = false\n\
\n\
[browser]\n\
gatherUsageStats = false\n\
" > /app/.streamlit/config.toml
EXPOSE 7860
# Important: prevent event loop crash by setting uvicorn policy
ENV PYTHONASYNCIODEBUG=1 # ! (MUST NEEDED OTHERWISE FILE UPLOAD WONT WORK)
CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]
# NOTE: Disable XSRF protection for easier external access in order to make batch predictions
Overwriting frontend/Dockerfile
Upload to Hugging Face Hub
# # HF Config
# access_token = "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"
repo_id = "nipunshah/superkart-frontend"
# # Login to Hugging Face
# login(token=access_token)
# # Initialize API
# api = HfApi()
# Upload backend folder
api.upload_folder(
folder_path="frontend", # Our local frontend folder
repo_id=repo_id, # Your space name
repo_type="space" # Type is space
)
'https://huggingface.co/spaces/nipunshah/superkart-frontend/tree/main/'
Sample csv for batch prediction
os.makedirs("testset", exist_ok=True)
%%writefile testset/test_batch.csv
Product_Weight,Product_MRP,Product_Allocated_Area,Store_Establishment_Year,Product_Sugar_Content,Product_Type,Store_Size,Store_Location_City_Type,Store_Type
12.66,117.08,0.027,2009,Low Sugar,Frozen Foods,Medium,Tier 2,Supermarket Type2
16.54,171.43,0.144,1999,Regular,Dairy,Medium,Tier 1,Departmental Store
14.28,162.08,0.031,1987,Regular,Canned,High,Tier 2,Supermarket Type1
9.57,123.67,0.01,1998,No Sugar,Health and Hygiene,Small,Tier 3,Food Mart
13.92,116.89,0.099,2009,Low Sugar,Fruits and Vegetables,Medium,Tier 2,Supermarket Type2
Writing testset/test_batch.csv
def test_batch_inference():
# Base URL of our backend
base_url = "https://nipunshah-superkart-backend.hf.space"
# Read the sample CSV
#df = pd.read_csv("testset/test_batch.csv")
# Open the file and send as multipart/form-data
files = {
'file': open('testset/test_batch.csv', 'rb')
}
# Make POST request to batch prediction endpoint
response = requests.post(
f"{base_url}/predict_batch",
#json=df.to_dict(orient="records")
files=files
)
# Display results
if response.status_code == 200:
response = response.json()
predictions = response['predictions']
print("Predictions:", predictions)
# Save predictions to CSV
# predictions_df = pd.DataFrame(predictions)
# predictions_df.to_csv('predictions.csv', index=False)
# print("Predictions saved to predictions.csv")
else:
print("Error:", response.text)
test_batch_inference()
Predictions: [2876.456053269243, 4846.866048789477, 4139.998187436777, 2044.3859239158705, 3140.2883973846215]
✨ Hurray, Verified ✅
from IPython.display import Image, display, HTML
Image(filename='../screenshots/projectspecifics/frontend_single1.png')
Image(filename='../screenshots/projectspecifics/frontend_single2.png')
Image(filename='../screenshots/projectspecifics/frontend_batch_1.png')
Image(filename='../screenshots/projectspecifics/frontend_batch_2.png')
Recommendations: